Advise needed on outsourcing the recording of unique views

Advise needed on outsourcing the recording of unique views - php

I have an online image sharing platform based on PHP5 + CodeIgniter. I would like the application to show to the user the number of unique views per image. Users can access image pages anonymous or signed in.
Although I know how to implement such a thing myself, I prefer to "outsource" the recording of unique pageviews, for reasons of performance and complexity (determining unique pageviews). My requirements for such a service:
Must record unique page views
Must have an API that allows me to get the #pageviews for one specific page programmatically. This goes beyond just displaying it, I may need to do calculations with this number
Semi-realtime information is good enough. Reasonable delays are acceptable
Low cost or free
My question is: Do you know of such a service and which one would you recommend or have experience with? In your answer, please assume an "outsourced" scenario, not DIY.

Piwik is a self-hosted, Open Source tool that sports an API. I have no experience with it yet, but looking at the API docs, it might be able to do what you need (no guarantees though).

Related

facebook/gmail alike web chatbox - what is a good way for nowadays chatapp to store text message?

I'm currently building a facebook alike chatbox, and I have encounter several considerations and problems along the way.
I had been googling useful resources all the time,like simple chatbox example or tutorial online.
My goal is to build one just like facebook/gmail chatbox and CometChat, I know it's hard and too much thing to scale behind the scene, but all I want to do is building it as simple as possible, and figuring out how facebook/gmail chatbox implement their chat functionality.
Progress:
I have finished facebook-like chatbox structure where I have sidebar at the right displaying online friends i can chat with, and popup chatbox at the bottom, and it is able to expand and minimize it.
I also have finished simple chatting based on MySQL database.
There's a table with 4 columns 'sender', 'receiver', 'message', 'time' for storing conversation.
My chatbox works this way:
1.The user send a message, and my front-end javascript will fetch the message the user type in and send the message to php file on the server via Ajax.
2. backend php file will store this message to MySQL.
3. The front-end will call the update function every 3 seconds to update the chatbox content if receiver send message to the sender, and show it out in frontend's chat.
I'm not sure this is a good way and long way to do, and I'm really concerned about it.
If users grow and grow, I have to think of ways to scale it well or my database and server will explode and frontend users might feel high latency in updating conversation.
Is BigTable a right way to do this if you have millions of users online?
How does facebook store their customer's text message or chat history in the backend well??
How does chat app like Whatapp store their text message?
Is it able to let the users chat directly to another user without storing state in server?
If I want to implement the chat history functionality in my chatbox, what is a good way to do ??
I am thinking server can create .txt file for each conversation in their file system, and it has a database table column to store the file path. Is this a good way and right way to do with chat history, I know its possible to do it this way, but im not sure if its a right way or good way.
I know this could be a huge, detailed application.
I'm asking not a detailed implementation but a big picture, concept of building it!
thank you!.

That's a good question and here's an attempt at answering it.
I believe you are thinking about scalability a bit too early. Your IM app might not reach the projected number of users for it to stop performing well. Consider enhancing your small product and scale as you go as much as is needed.
Disk I/O is one of the issues that you will face scaling your web application. Storing communication directly onto the disk with txt file might not be a reliable solution.
Push your technology stack to its limits before considering changing it or switching to something else. I assume you are using a relational database for your storage (since you mentioned columns and rows, which is not an ultimate indicator but still), there are other options out there that have good benchmarking results at the expense of multiple other compromises. (NoSQL: which you referred to as BigTable) is one option. Relational databases are great, they have been for quite a long time the industry standard but currently there are alternative solutions that are quite promising.
Look into NoSQL document based datastorage solutions such as MongoDB, CoucheDB or even Casandra and there are many others. There is a considerable amount of information about the performance of each, under specific circumstances and situations. Choose what is best for the problem at hand and not what is most fashionable or hipped.
Another option would be to outsource your scalability problems to a 3rd Party provider such as Firebase. In this situation all you have to worry about is your product and not what's happening under the hood.
Store only the data that you need and archive or dismiss what you don't.
With scalability there are generally 2 broad categories: Horizontal and Vertical scaling.
Horizontal: means adding more nodes to your system i.e. adding more server instances to handle the extra load. There are many cloud solution providers out there that make this genre of scaling very cheap and instantaneous.
Vertical: means adding more resources to the node you are currently running your app from in addition to use specific technologies that allow you to take full advantages of your resources. This optimization happens on the level of the instance resources (i.e. CPU, RAM, Disk Space etc...) and your data storage, programming language of choice, algorithms you are using etc... You might realize that php and mysql aren't the tools for this job, but that's arguable.
Read More about it here
Distributed Systems architects / programmers also take advantage of other (faster) programming languages at runtime (such as C, C++ or even Java) to speed up certain tasks. Look into how you can dissect your application into smaller decoupled modules / components that can run independently. (But i'm not sure if you will ever reach this stage with an IM client unless it becomes as popular as Whatsapp or Facebook chat).
I advise you to grab and read a couple of books about scaling web applications and leveraging cloud computing. Study scalable architectures and design your application depending on your business logic based on them.
This is a very broad and complex topic, I'm sure others might have additional interesting insight on the matter.

Legal script that scrapes and indexes?

I want to create a website that scrapes certain websites (specified by me) to collect data and pricing and then offer that data as search results on my own site. So basically like a search engine, but for specific sites, indexed in a specific way. I can write this myself, but would like to know:
Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
What if I make money off this?
Are there any popular PHP scripts that already do this?
The legal aspect has been covered. I found a way around this (well, I got permission from the persons creating the content)... so the only real question is: what can I use to crawl the content, especially keeping in mind, each site will have diffrent rules that I will have to set up? It must also be clever enough to not spider the same content twice?

Is it legal?
Yes. And no. Probably.
There isn't one set of laws covering the entire planet, and SO isn't really for legal advice, you need to find a lawyer in your jurisdiction.
My own thoughts are that you would probably be okay in most jurisdictions as long as you use only the information. So, no eBay logos, no representations that you may be associated with them and so on.
But I am not a lawyer (though I deal a lot with the US sub-species as part of my work), certainly not your lawyer, and this advice (which isn't legal advice) is worth every cent you paid for it, which is ZERO!
What if I make money of this?
Good for you :-) Make mega-bucks. But see above point.
Are there any popular PHP scripts that already do this?
That's the bit I can't answer. My experience with PHP ranges somewhere between zero and nothing.

The legality is a bit shady in this area. You should look for the presence of a robots.txt ( http://www.robotstxt.org/robotstxt.html ) file to first determine if the website welcomes web spiders.
Also, there is a very good PHP search script called sphider ( http://www.sphider.eu/ ), you should have a look at.
EDIT:
I can't see many websites having an issue with you taking snippets of their website and then linking users onto the webpage which the content came from. However, if you plan on just taking all their content and displaying it on your own website in order to make profit, I can only assume many web sites would have an issue as they are the ones who should be profiting off the content.

1) Is it legal? Can I grab for example, all the items off ebay, put it in a search engine and allow users to search ebay using my site?
This is technically feasible. You can build a PHP script that does this quite easily. I would say that it is borderline illegal however, because by scraping content from somebody elses site you will be using their intellectual property, their data without permission.
2) What if I make money off this?
Then the original owners of the data are very likely to come after you, issue a cease and desist notice then sue you. An organization as large as ebay could do this without blinking.
3) Are there any popular PHP scripts that already do this?
Because of the questionable legal nature of your question, I highly doubt there are any scripts that already do this.
The correct technique of getting data from ebay and other large data providers is by using APIs, or application programming interfaces. These are special protocols, languages, designed for programs to communicate with each other. This has the benifit of being significantly more efficient than page-scraping, while also being a known legal way to get data from a provider.
More information about the ebay specific API can be found here; http://developer.ebay.com/common/api/

PHP - detecting changes in external database-driven site

For a homework project, I'm creating a PHP driven website which main function is aggregating news about various university courses.
The main problem is this: (almost) each course has it's own website. These are usually just plain HTML or built using some simple free CMS system.
As a student, participating in 6-7 courses, almost every day you go through 6-7 websites checking if there are any news. The idea behind the project is that you don't have to do that, instead, you just check the aggregation site.
My idea is the following: each time a student logs in, go through his course list. For every course, get it's website (recursively, like with wget), and create a hash value of it. If the hash is different then one stored in database, we know that site has changed, and we notify the student.
So, what do you think, is this reasonable way to achieve the functionality?
And if yes, what is (technically) the best way to go about this? I was checking php_curl, put I don't know if it can get a website recursively.
Furthermore, there's a slight problem I have somewhat limited resources, only a few MB of quota on public (university) server. However, if that's a big problem, I could use a seperate hosting solution.
Thanks :)

Just use file_get_contents, or cURL if you absolutely have to (in case you need COOKIES).
You can use your hashing trick to check for modifications but it's not very elegant. What you want to know is when was it last changed. I doubt this information is on the website, but maybe they offer an RSS feed or some webservice or API you can use for this purpose.
Don't worry about doing recursive requests. Just make a new request each time.
"When all else fails, build a scraper"

What is a best practice method to log visits per page / object [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Take my profile for example, or any question number of views on this site, what is the process of logging the number of visits per page or object on a website, which I presumably think includes:
Counting registered users once (this must be reflected in the db, which pages / objects the user has visited). this will also not include unregistered users
IP: log the visit of each IP per page / object; this could be troublesome as you might have 2 different people checking the same website; or you really do want to track repeat visitors.
Cookie: this will probably result in that people with multiple computers would be counted twice
other method goes here ....
The question is, what is the process and best practice to count user requests?
EDIT
I've added the computer languages to the list of tags as they are of interest to me. Feel free to include any libraries, modules, and/or extensions that achieve the task.
The question could be rephrased into:
How does someone go about measuring the number of imprints when a user goes on a page? The question is not intended to be similar to what Google analytics does, rather it should be something similar to when you click on a stackoverflow question or profile and see the number of views.

The "correct" answer varies according to the situation; primarily the most desired statistic and the availability of resources to gather and process them:
eg:
Server Side
Raw web server logs
All webservers have some facility to log requests. The trouble with them is that it requires a lot of processing to get meaningful data out and, for your example scenario, they won't record application specific details; like whether or not the request was associated with a registered user.
This option won't work for what you're interested in.
File based application logs
The application programmer can apply custom code to the application to record the stuff you're most interested in to a log file. This is similiar to the webserver log; except that it can be application aware and record things like the member making the request.
The programmers may also need to build scripts which extract from these logs the stuff you're most interested. This option might be suited to a high traffic site with lots of disk space and sysadmins who know how to ensure the logs get rotated and pruned from the production servers before bad things happen.
Database based application logs
The application programmer can write custom code for the application which records every request in a database. This makes it relatively easy to run reports and makes the data instantly accessible. This solution incurs more system overhead at the time of each request so better suited to lesser traffic sites, or scenarios where the data is highly valued.
Client Side
Javascript postback
This is a consideration on top of the above options. Google analytics does this.
Each page includes some javascript code which tells the client to report back to the webserver that the page was viewed. The data might be recorded in a database, or written to file.
Has an strong advantage of improving accuracy in scenarios where impressions get lost due to heavy caching/proxying between the client and server.
Cookies
Every time a request is received from someone who doesn't present a cookie then you assume they're new and record that hit as 'anonymous' and return a uniquely identifying cookie after they login. It depends on your application as to how accurate this proves. Some applications don't lend themselves to caching so it will be quite accurate; others (high traffic) encourage caching which will reduce the accuracy. Obviously it's not much use till they re-authenticate whenever they switch browsers/location.
What's most interesting to you?
Then there's the question of what statistics are important to you. For example, in some situations you're keen to know:
how many times a page was viewed, period,
how many times a page was viewed, by a known user
how many of your known users have viewed a specific page
Thence you typically want to break it down into periods of time to see trending.
Respectively:
are we getting more views from random people?
or we getting more views from registered users?
or has pretty much every one who is going to see the page now seen it?
So back to your question: best practice for "number of imprints when a user goes on a page"?
It depends on your application.
My guess is that you're best off with a database backed application which records what is most interesting to your application and uses cookies to trace the member's sessions.

The best practice for a hit counter depends on how much traffic you expect your site to receive. As wybiral suggested, you can implement something that writes to a database after every request. This might include the IP address if you want to count unique visitors, or it could be a simple as just incrementing a running total for each page or for each (page, user) pair.
But that requires a database write for every request, even if you just want to serve a static page. Ideally speaking, a scalable web app should serve as much as possible from an in-memory cache. Database or disk I/O should be avoided as much as possible.
So the ideal set up would be to build up some representation of the server's activity in-memory and then occasionally (say every 15 minutes) write those events to the database. You could conceivably queue up thousands of requests and then store them with a single database write.
There's a tutorial describing how to do exactly this in python using Celery and Carrot: http://packages.python.org/celery/tutorials/clickcounter.html. It also includes some examples of how to set up your database tables using Django models and what code to call whenever someone accesses a page.
This tutorial will certainly be helpful to you regardless of what you choose to implement, although this level of architecture might be overkill if you don't expect thousands of hits each hour.

Use a database to keep a record of the unique IPs (if the IP doesn't exist in the DB, create it, otherwise continue as planned) and then query the database for the number of those entities. Index this with IP and URL to store views for individual pages. You wont have to worry about tracking registered users this way, they will be totaled into the unique IP count. As far as multiple people from one IP, there's not much you can do there short of requiring an account and counting user->to->page-views similarly.

I would suggest using a persistent key/value store like Redis. If you use a list with the list key being the serialized identifier, you can store other serialized entries and use llen to find the list size.
Example (python) after initializing your Redis store:
def intializeAndPush(serializedKey, serializedValue):
if not redisStore.exists(serializedKey):
redisStore.push(serializedKey, serializedValue)
else:
if serializedValue not in redisStore.lrange(serializedKey, 0, -1):
redisStore.push(serializedKey, serializedValue)
def getSizeOf(serializedKey):
if redisStore.exists(serializedKey):
return redisStore.llen(serializedKey)
else:
return 0
Using this technique, you can use anything as serializedKey or serializedValue. If you want to store IPs with today's date or serialized login information, both are just as simple. Also, only unique serializedValues are stored since writes are locked on read (at least as I recall).

I will try and implement pixel tracking to track views on your page/object. This method is used by google (google analytics) and other high profile media companies.

Pixel tracking will be fine, since you can have point the trackingpixel to a HttpHandler specific for that purpose. That way you can seperate the load and even use some kind of queue for highload scenarios.
Also, you can incorporate user specific information in the tracking pixel such as WHO has visited the page.
eg:
<a href="fakeimages/imba.gif?uid=123&info2=a&info3=b" style="height:1px;width:1px;" />
Then you need to handle the request going to fakeimages/*.gif with a specific HttpHandler / php redirect/controller (whatever language you're using) and process the infos.
regards

Logging/tracking in PHP: Scribe, Chukwa, log4php?

This is probably a pretty high-level question that requires a lot of explaining, but I'm in need of a lot of explaining.
Basically I'm developing a PHP application that requires a lot of logging and tracking. Tracking clicks, interactions, performance, etc. etc. Anything under the sun. Facebook's Scribe and Yahoo's Chukwa are both great implementations of this. I know little about log4php.
What I want is a high-level overview of how this kind of logging works, specifically in conjunction with a PHP application. You can stop at the point where the log gets processed; I already know that I want to use Hadoop/Hive for processing and storage.
I'd also like some fairly low-level looks at what happens within the application itself. For example, how does one take the behavior of a click and send that to the logger? I'd appreciate any reading that can help get me started, as well.

You can buy/get the tools to do this for you or build in-house.
buy/get:
1 - Tag your pages with Google/Yahoo analytics - This will track pageviews, page flow performance, SEO ranking for keywords, etc.
2 - For tracking and logging user behavior, which include clicks, interactions and performance. I found nothing better than ClickTale - http://www.clicktale.com/default_e.aspx - It video records user sessions and puts these "log files" in a server.
in-house:
1 - Creating hidden fields in your forms that submits to a logging database also works. You specify unique IDs to forms and keep track of it's actions during submits.
I'm sure there's lots more, but these are the basics. These are not PHP specific though.
HTH
EDIT #1 :
This may be beyond the scope of your question, but tracking doesn't necessarily mean data that goes in-house. An example would be adding a "like it" or "digg it" button to articles or pages. This will "log" popularity for you. You can go to facebook or digg.com to see progress of your site. it'll also help with SEO. basically, it's a tracking system. And it's easy to use. there are PHP snippets out there that you can copy and paste to your code. If you have WordPress, there is a plugin - just look for "digg", "like it" in the plugin search section.
Going back to Google Analytics, if you want to go beyond tracking clicks, go ahead and make goals/funnels. It'll track user behavior, and answer questions such as "What were my most valuable keywords?" "where are all my users dropping off?" "what is the bounce rate for each page?" "what are the top 3 entry points to my site and from what traffic medium?" these are question SEO/SEM managers are most concerned about. and it's definitely good to track and understand.
ClickTale starts where Google Analytics ends. GA will describe user behavior in the page level, but not in the field level. ClickTale, which has heat maps, will answer these questions "I know this page has a high bounce rate, but why? which field is a problem field for my customers?" "At what area of the page do users spend most of their time in?" "how do i prove to the graphics guys that a particular section needs to be redesigned?".
EDIT #2
For high traffic sites, you will need to scale your logging DB. It really helps when it comes to reporting. What I suggest is a 3-tier database reporting structure. tier 1 = last 7 days, tier 2 = last 6 months, tier = everything. You can modify these according to the business. The point being, data moves from one tier to another. keeping fresh data readily available. You want to generate reports asap. A a single huge DB just doesn't scale.

You can monitor user clicks by logging the path the user is taking, referrer --> new uri, assuming both are verbose and descriptive enough. For example, if a user clicks on one of his friends you should log the uris:
Referrer: /users/41251
Target: /users/66257
storing them properly for easy querying and reporting. Here a direct click like that would assume the target is in the referrer's page, so is a friend. If you have more complicated scenarios be sure to describe them with distinct uris, eg: /users/suggestion/14152 for a suggested connection.
Add to that timestamps and you have a very rough estimate of how long they stayed on each page, although users tend to lose focus, switch tabs/applications and come back, etc. Google Analytics, for one, does this well.
For a summary of where users click most on your site using heatmaps I like the free (GPL) Clickheat.

Check out Splunk

On the frontend where you're doing the logging from, here is some sample PHP code that you might find useful:
http://www.alphadevx.com/a/85-Logging-Messages-to-Scribe-from-PHP
In terms of the architecture, you have a lot of flexibility with Scribe. I would recommend having a local Scribe instance running on each application node, and having your application log locally to localhost. These local Scribe instances can in turn be configured to log to a central Scribe server when it is not too busy, otherwise they will continue to queue up messages locally. You actually consume your logs on the central server where they are aggregated by category.
I'm a big fan of Scribe, and I think it's designed well is so far as it's got a very small memory and processor footprint, and it is quite easy to configure (although murder to install due to the dependencies!). It just lacks documentation.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.