Logging/tracking in PHP: Scribe, Chukwa, log4php?

Logging/tracking in PHP: Scribe, Chukwa, log4php? - php

This is probably a pretty high-level question that requires a lot of explaining, but I'm in need of a lot of explaining.
Basically I'm developing a PHP application that requires a lot of logging and tracking. Tracking clicks, interactions, performance, etc. etc. Anything under the sun. Facebook's Scribe and Yahoo's Chukwa are both great implementations of this. I know little about log4php.
What I want is a high-level overview of how this kind of logging works, specifically in conjunction with a PHP application. You can stop at the point where the log gets processed; I already know that I want to use Hadoop/Hive for processing and storage.
I'd also like some fairly low-level looks at what happens within the application itself. For example, how does one take the behavior of a click and send that to the logger? I'd appreciate any reading that can help get me started, as well.

You can buy/get the tools to do this for you or build in-house.
buy/get:
1 - Tag your pages with Google/Yahoo analytics - This will track pageviews, page flow performance, SEO ranking for keywords, etc.
2 - For tracking and logging user behavior, which include clicks, interactions and performance. I found nothing better than ClickTale - http://www.clicktale.com/default_e.aspx - It video records user sessions and puts these "log files" in a server.
in-house:
1 - Creating hidden fields in your forms that submits to a logging database also works. You specify unique IDs to forms and keep track of it's actions during submits.
I'm sure there's lots more, but these are the basics. These are not PHP specific though.
HTH
EDIT #1 :
This may be beyond the scope of your question, but tracking doesn't necessarily mean data that goes in-house. An example would be adding a "like it" or "digg it" button to articles or pages. This will "log" popularity for you. You can go to facebook or digg.com to see progress of your site. it'll also help with SEO. basically, it's a tracking system. And it's easy to use. there are PHP snippets out there that you can copy and paste to your code. If you have WordPress, there is a plugin - just look for "digg", "like it" in the plugin search section.
Going back to Google Analytics, if you want to go beyond tracking clicks, go ahead and make goals/funnels. It'll track user behavior, and answer questions such as "What were my most valuable keywords?" "where are all my users dropping off?" "what is the bounce rate for each page?" "what are the top 3 entry points to my site and from what traffic medium?" these are question SEO/SEM managers are most concerned about. and it's definitely good to track and understand.
ClickTale starts where Google Analytics ends. GA will describe user behavior in the page level, but not in the field level. ClickTale, which has heat maps, will answer these questions "I know this page has a high bounce rate, but why? which field is a problem field for my customers?" "At what area of the page do users spend most of their time in?" "how do i prove to the graphics guys that a particular section needs to be redesigned?".
EDIT #2
For high traffic sites, you will need to scale your logging DB. It really helps when it comes to reporting. What I suggest is a 3-tier database reporting structure. tier 1 = last 7 days, tier 2 = last 6 months, tier = everything. You can modify these according to the business. The point being, data moves from one tier to another. keeping fresh data readily available. You want to generate reports asap. A a single huge DB just doesn't scale.

You can monitor user clicks by logging the path the user is taking, referrer --> new uri, assuming both are verbose and descriptive enough. For example, if a user clicks on one of his friends you should log the uris:
Referrer: /users/41251
Target: /users/66257
storing them properly for easy querying and reporting. Here a direct click like that would assume the target is in the referrer's page, so is a friend. If you have more complicated scenarios be sure to describe them with distinct uris, eg: /users/suggestion/14152 for a suggested connection.
Add to that timestamps and you have a very rough estimate of how long they stayed on each page, although users tend to lose focus, switch tabs/applications and come back, etc. Google Analytics, for one, does this well.
For a summary of where users click most on your site using heatmaps I like the free (GPL) Clickheat.

Check out Splunk

On the frontend where you're doing the logging from, here is some sample PHP code that you might find useful:
http://www.alphadevx.com/a/85-Logging-Messages-to-Scribe-from-PHP
In terms of the architecture, you have a lot of flexibility with Scribe. I would recommend having a local Scribe instance running on each application node, and having your application log locally to localhost. These local Scribe instances can in turn be configured to log to a central Scribe server when it is not too busy, otherwise they will continue to queue up messages locally. You actually consume your logs on the central server where they are aggregated by category.
I'm a big fan of Scribe, and I think it's designed well is so far as it's got a very small memory and processor footprint, and it is quite easy to configure (although murder to install due to the dependencies!). It just lacks documentation.

Related

PHP dealing with concurrency

I'm running an enterprise level PHP application. It's a browser game with thousands of users online on an infrastructure that my boss refuses to upgrade and the machinery is running on 2-3 system load (yep linux) at all times. Anyhow that's not the real issue. The real issue is that some users wait until the server gets loaded (prime time) and they bring their mouse clickers and they click the same submit button like 10 - 20 times, sending 10-20 requests at the same time while the server is still producing the initial request, thus not updated the cache and the database.
Currently I have an output variable on each request, which is valid for 2 minutes and I have "mutex" lock which is basically a flag inside memcache which if found blocks the execution of the script further, but the mouse clicker makes so many requests at the same time that they run almost simultaneously which is a big issue for me.
How are you, the majority of StackOverflow folks dealing with this issue. I was thinking of flagging the cookie/session but I think I will get in the same issue if the server gets overloaded. Optimization is impossible, the source is 7 years old and is quite optimized, with no queries on most pages (running off of cache) and only querying the database on certain user input, like the one I'm trying to prevent.
Yep it's procedural code with no real objects. Machines run PHP 5 but the code itself is more of a PHP 4. I know, I know it's old and stuff but we can't spare the resource of rewriting this whole mess since most of the original developers left that know how stuff is intertwined and yeah, I'm basically patching old holes. But as far as I know this is a general issue on loaded PHP websites.
P.S: Disabling the button with javascript on submit is not an option. The real cheaters are advanced users. One of them had written a bot clicker and packed it as a Google Chrome extension. Don't ask how I dealt with that.

I would look for a solution outside your code.
Don't know which server you use but apache has some modules like mod_evasive for example.
You can also limit connections per second from an IP in your firewall

I'm getting the feeling this is touching more on how to update a legacy code base than anything else. While implementing some type of concurrency would be nice, the old code base is your real problem.
I highly recommend this video which discusses Technical Debt.
Watch it, then if you haven't already, explain to your boss in business terms what technical debt is. He will likely understand this. Explain that because the code hasn't been managed well (debt paid down) there is a very high level of technical debt. Suggest to him/her how to address this by using small incremental iterations to improve things.

limiting the IP connections will only make your players angry.
I fixed and rewrote a lot of stuff in some famous opensource game clones with old style code:
well, i must say that cheating can be always avoid executing the right queries and logic.
for example look at here http://www.xgproyect.net/2-9-x-fixes/9407-2-9-9-cheat-buildings-page.html
Anyway, about performace, keep in mind that code inside sessions will block all others thread untill current one is closed. So be carefull to inglobe all your code inside sessions.Also, sessions should never contain heavy data.
About scripts: in my games i have a php module that automatically rewrite links adding an random id saved in database, a sort of CSRFprotection. Human user will click on the changed link, so they will not see the changes but scripts will try to ask for the old link and after some try there are banned!
others scripts use the DOM , so its easy to avoid them inserting some useless DIV around the page.
edit: you can boost your app with https://github.com/facebook/hiphop-php/wiki

I don't know if there's an implementation already out there, but I'm looking into writing a cache server which has responsibility for populating itself on cache misses. That approach could work well in this scenario.
Basically you need a mechanism to mark a cache slot as pending on a miss; a read of a pending value should cause the client to sleep a small but random amount of time and retry; population of pending data in a traditional model would be done by the client encountering a miss instead of pending.
In this context, the script is the client, not the browser.

To have multiple sub-domains or multiple separate domains?

My client has a host of Facebook pages that have become very successful. In order to move away from big brother Facebook my client wishes to create a large dynamic site that incorporates the more successful parts of the Facebook empire.
One of my client's spin off sites has been created and is getting a lot of traffic. I'm not sure exactly how much but it hit 90 Gigs in a month as the space allocated need to be increased.
In any case my client has dreamed up a massive website with its own community looking to put the community under the one banner. However I am concerned that it will get thrashed, bottlenecks, long load time, etc.
My questions:
Will a managed dedicated server be able to handle a potentially large amount of traffic?
Is it going to be better to create various parts of the empire in their own separate hosting and domain (normal hosting or VPS), or is it better to have them all under the one hood (i.e. using sub-domains).
If they were all together would it be better for SEO and easier to manage? Or if they are separate, they may be quicker but would it need some sort of Passport user system so people can log into any of the website with the same user details?
Whats the best way to implement a Passport style user system? Do you remotely connect to databases? Or run a regular a Cron job that updates each individual user details on each domain? Maybe run CURL request to the other site given then any new data?
Any other Pros/Cons to keeping all the section together or separating them?
Large site like Facebook manages to have everything under the one root. Then sites like eBay have separate domain names but you can use the same user login across all of them.
I'm not sure what the best option is and would appreciate any guidance.

It is a very general question but to give some hints:
Measure, measure and measure again. Know what kind of parts are used heavily and which are not.
Fix things and go back to 1.
Really: Without knowing what takes lots of time, what is used most heavily etc. you cannot say anything usefull.
VPS or dedicated servers are not the right question. You start with: What do I have to do for the users. Then: How am I going to do that? (for example: in database, in scripts, in message queue) and then finally you see how much hardware you need.
One or multiple domains doesn't really matter. Though one exception: For static content it might be interesting if you have lots of it to use a CDN like Amazon. Read for example: http://highscalability.com/blog/2011/12/27/plentyoffish-update-6-billion-pageviews-and-32-billion-image.html where you can read some things about the possibilities with a CDN.
In general serving static content from a static domain is useful many other things don't really need that. So there you could just consider all in one domain.

What is a best practice method to log visits per page / object [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Take my profile for example, or any question number of views on this site, what is the process of logging the number of visits per page or object on a website, which I presumably think includes:
Counting registered users once (this must be reflected in the db, which pages / objects the user has visited). this will also not include unregistered users
IP: log the visit of each IP per page / object; this could be troublesome as you might have 2 different people checking the same website; or you really do want to track repeat visitors.
Cookie: this will probably result in that people with multiple computers would be counted twice
other method goes here ....
The question is, what is the process and best practice to count user requests?
EDIT
I've added the computer languages to the list of tags as they are of interest to me. Feel free to include any libraries, modules, and/or extensions that achieve the task.
The question could be rephrased into:
How does someone go about measuring the number of imprints when a user goes on a page? The question is not intended to be similar to what Google analytics does, rather it should be something similar to when you click on a stackoverflow question or profile and see the number of views.

The "correct" answer varies according to the situation; primarily the most desired statistic and the availability of resources to gather and process them:
eg:
Server Side
Raw web server logs
All webservers have some facility to log requests. The trouble with them is that it requires a lot of processing to get meaningful data out and, for your example scenario, they won't record application specific details; like whether or not the request was associated with a registered user.
This option won't work for what you're interested in.
File based application logs
The application programmer can apply custom code to the application to record the stuff you're most interested in to a log file. This is similiar to the webserver log; except that it can be application aware and record things like the member making the request.
The programmers may also need to build scripts which extract from these logs the stuff you're most interested. This option might be suited to a high traffic site with lots of disk space and sysadmins who know how to ensure the logs get rotated and pruned from the production servers before bad things happen.
Database based application logs
The application programmer can write custom code for the application which records every request in a database. This makes it relatively easy to run reports and makes the data instantly accessible. This solution incurs more system overhead at the time of each request so better suited to lesser traffic sites, or scenarios where the data is highly valued.
Client Side
Javascript postback
This is a consideration on top of the above options. Google analytics does this.
Each page includes some javascript code which tells the client to report back to the webserver that the page was viewed. The data might be recorded in a database, or written to file.
Has an strong advantage of improving accuracy in scenarios where impressions get lost due to heavy caching/proxying between the client and server.
Cookies
Every time a request is received from someone who doesn't present a cookie then you assume they're new and record that hit as 'anonymous' and return a uniquely identifying cookie after they login. It depends on your application as to how accurate this proves. Some applications don't lend themselves to caching so it will be quite accurate; others (high traffic) encourage caching which will reduce the accuracy. Obviously it's not much use till they re-authenticate whenever they switch browsers/location.
What's most interesting to you?
Then there's the question of what statistics are important to you. For example, in some situations you're keen to know:
how many times a page was viewed, period,
how many times a page was viewed, by a known user
how many of your known users have viewed a specific page
Thence you typically want to break it down into periods of time to see trending.
Respectively:
are we getting more views from random people?
or we getting more views from registered users?
or has pretty much every one who is going to see the page now seen it?
So back to your question: best practice for "number of imprints when a user goes on a page"?
It depends on your application.
My guess is that you're best off with a database backed application which records what is most interesting to your application and uses cookies to trace the member's sessions.

The best practice for a hit counter depends on how much traffic you expect your site to receive. As wybiral suggested, you can implement something that writes to a database after every request. This might include the IP address if you want to count unique visitors, or it could be a simple as just incrementing a running total for each page or for each (page, user) pair.
But that requires a database write for every request, even if you just want to serve a static page. Ideally speaking, a scalable web app should serve as much as possible from an in-memory cache. Database or disk I/O should be avoided as much as possible.
So the ideal set up would be to build up some representation of the server's activity in-memory and then occasionally (say every 15 minutes) write those events to the database. You could conceivably queue up thousands of requests and then store them with a single database write.
There's a tutorial describing how to do exactly this in python using Celery and Carrot: http://packages.python.org/celery/tutorials/clickcounter.html. It also includes some examples of how to set up your database tables using Django models and what code to call whenever someone accesses a page.
This tutorial will certainly be helpful to you regardless of what you choose to implement, although this level of architecture might be overkill if you don't expect thousands of hits each hour.

Use a database to keep a record of the unique IPs (if the IP doesn't exist in the DB, create it, otherwise continue as planned) and then query the database for the number of those entities. Index this with IP and URL to store views for individual pages. You wont have to worry about tracking registered users this way, they will be totaled into the unique IP count. As far as multiple people from one IP, there's not much you can do there short of requiring an account and counting user->to->page-views similarly.

I would suggest using a persistent key/value store like Redis. If you use a list with the list key being the serialized identifier, you can store other serialized entries and use llen to find the list size.
Example (python) after initializing your Redis store:
def intializeAndPush(serializedKey, serializedValue):
if not redisStore.exists(serializedKey):
redisStore.push(serializedKey, serializedValue)
else:
if serializedValue not in redisStore.lrange(serializedKey, 0, -1):
redisStore.push(serializedKey, serializedValue)
def getSizeOf(serializedKey):
if redisStore.exists(serializedKey):
return redisStore.llen(serializedKey)
else:
return 0
Using this technique, you can use anything as serializedKey or serializedValue. If you want to store IPs with today's date or serialized login information, both are just as simple. Also, only unique serializedValues are stored since writes are locked on read (at least as I recall).

I will try and implement pixel tracking to track views on your page/object. This method is used by google (google analytics) and other high profile media companies.

Pixel tracking will be fine, since you can have point the trackingpixel to a HttpHandler specific for that purpose. That way you can seperate the load and even use some kind of queue for highload scenarios.
Also, you can incorporate user specific information in the tracking pixel such as WHO has visited the page.
eg:
<a href="fakeimages/imba.gif?uid=123&info2=a&info3=b" style="height:1px;width:1px;" />
Then you need to handle the request going to fakeimages/*.gif with a specific HttpHandler / php redirect/controller (whatever language you're using) and process the infos.
regards

Architectural advice on connecting multiple diverse sites into a single community

I've been given a task to connect multiple sites of the same client into a single network. So i would like to hear an architectural advice on connecting these sites into a single community.
These sites include:
1. Invision Power Board Forum (the most important site)
2. 3 custom made cms-s (changes to code allowable)
3. 1 drupal site
4. 3-4 wordpress blogs
Requirements are as follows:
1. Connecting all users of all sites into a single administrable entity. With permissions changing ability, users banning etc.
2. Later on, based on this implementation I have to implement "facebook like" chat, which will be available to all users regardless of place of login.
I have few ideas on my mind on how to go with this, but would like to hear some people with more experience and expertize than my self.
Cheers!

You're going to have one hell of a time. Each of those site platforms has a very disparate user architecture: there is no way to "connect" them all together fluidly without numerous codebase changes. You're looking at making deep changes to each of those platforms to communicate with a central database, likely modifying thousands (if not tens of thousands) of lines of code.
On top of the obvious (massive) changes to all of the platforms, you're going to have to worry about updates: what happens when a new version of Wordpress is released? You'd likely have to update all of your code manually (since you can't just drop in the changes). You'd also have to make sure that all of the code changes are compatible with your current database. God forbid one of the platforms starts storing user information differently---you'd have to make more massive code changes. This just isn't maintainable.
Your alternative (and best bet) is to have some sort of synchronization job that runs every hour or so: iterate through each user in each database and compare it to see if it both exists and is up-to-date in the other databases. If not, push the changes out. The problem with this is that it will get significantly slower as you get more and more users.
Perhaps another alternative is to simply offer a custom OpenID implementation. I believe that Drupal and Wordpress both have OpenID plugins that you can take advantage of. This way, you could allow your users to sign in with a pseudo-single sign-on service across your sites. The downside is that users could opt not to use it.
Good luck

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

force a reCAPTCHA every 10 page loads for each unique IP

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?

Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.