I've got a php rss feed. There's a lot of domains that are using my RSS feed for news. I'd like to be able to track which domains are using my rss feed. I tried using $_SERVER['http_referrer'] to no avail.
Perhaps if you link an image in your feeds, the client will load them and then you will have a referer to look for.
You could of course link to a script which doesn't really load a visible image but tracks the traffic
$_SERVER["REMOTE_ADDR"] is the best you can do to find out the client's IP address. That is not identical to the domain of the site that a possible bot would be working for, though, and will not tell you in what ways your content is re-used.
One thing you could do is attach a "?from=feed" flag to any links that point to your site from the feed. That way, you could at least tell how many visitors come to your site through your feed. The referer variable will then contain the site the link was published on. This is pretty accurate but of course works only if people click the links.
Have you tried your web server logs? You could parse and filter all lines containing/listing access to the resource.
Related
I have written a PHP based blog for the company i work for. Not using any frameworks. I am having trouble tracking users who come from my facebook page's posts to my blog (not wordpress).
I have created a shortlink url. Let's say it is sample.co and it redirects traffic to sample.com. Everything seems fine until here. The problem starts here.
I am adding all user's ip's, user agents. But if even i get 500 visits, my code adds somethig like 3.000 visits. Facebook stats and Analytics shows similar stats (~500 visits). I see that ip's added to MySQL are all different. It usually happens with Android users. I have read somewhere that Facebook sometimes renders to their users the actual URL when FB shows the post. I mean instead of the widget, Facebook shows the whole page. I am not quite sure about that to be honest.
To solve this problem, I have created and added an jquery script to my page and listened users' scroll event. It worked great. Not seeing too much traffic. But this time the problem is i am counting less users. Even I get 500 users from facebook and Analytics shows similar results, my script adds only 200-300 to MySQL.
Does anyone know a better way to track real traffic? Or do you aware of such problem?
Thanks
It should be filtered on the basis of user agent.
https://developers.facebook.com/docs/sharing/webmasters/crawler
how to detect search engine bots with php?
Identifying users through IP is a good idea, but if your IP keeps changing, it's a good idea to use cookies.
http://php.net/manual/en/function.uniqid.php
If the cookie does not exist, you should see it as a new user.
I have found the answer. The problem is called preview (prefetch). Here is the link:
https://www.facebook.com/business/help/1514372351922333
Simply, facebook preloads everything when FB shows the thumbnail to the visitor to speed up your page's load speed. They send X-Purpose: preview header. So you can simply check if HTTP_X_PURPOSE header's value is preview or not. If so, do not count it as a visitor.!
Here are more detailed descriptions:
http://inchoo.net/dev-talk/mitigating-facebook-x-fb-http-engine-liger/
http://inchoo.net/dev-talk/magento-website-hammering-facebook-liger/
I am working on a project which needs to extract data from website by parsing its html and getting the content out of title tag and meta description.I am able parsing that data from normal website, but in this matter the website is only can be access using IP address as the URL.Is it possible to be extract and what solution can be use?
A URL doesn't need a domain name, something like http://127.0.0.1/test.php is a valid url and all scraper should work with this correctly.
This requires the website to respond on requests to the ip-based url. Those on private servers or very big sites might do, sites from ordinary shared hosters usually don't as they host multiple sites with the same ip.
Okay, so when you post a link on Facebook, it does a quick scan of the page to find images and text etc. to create a sort of preview on their site. I'm sure other social networks such as Twitter do much the same, too.
Anyway, I created a sort of "one time message" system, but when you create a message and send the link in a chat on Facebook, it probes the page and renders the message as "seen".
I know that the Facebook probe has a user agent of facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php), so I could just block all requests from anything with that user agent, but I was wondering if there's a more efficient way of achieving this with all sites that "probe" links for content?
No there's no fool-proof way to do this. The easiest way to achieve something like this is to manually block certain visitors from marking the content as seen.
Every entity on the web identifies itself with a user agent, although not every non-human entity identfies itself in an unique way there are online database like this one that can help achieve your goal.
In case of trying to block all bots via robots.txt, not every bot holds up to that standard. I will speculate that Facebook may try to prevent malware from being spread across their network by visiting any shared link.
you could try something like this in your robots.txt file
User-agent: *
Disallow: /
I am writing a program in PHP that requests a domain name from the user, and from this the user will be displayed with social networking information (likes, shares, comments etc) about the domain.
I have tried using several techniques, including;
"https://graph.facebook.com/v2.1/?id=www.ebay.co.uk"
Although this has a response, it seems to have given me Facebook information regarding the a singular page (home page), instead of the entire site.
Is there a way for me to find the Facebook page of a company when I only have their domain name available?
Any time would be appreciated! Cheers!
The Open Graph Protocol is not hierarchical, but a flat structure.
Each page with an og: tag is treated as a completely distinct element.
Is there a way for me to find the Facebook page of a company when I only have their domain name available?
You can use the og:site_name tag if it's available on the website.
Take tutsplus.com as an example, they have the og:site_name tag with "tuts+" as the value. When searching facebook for a fanpage with the name "Tuts+" you will find facebook.com/tutsplus as the first result.
Sorry for the long title and perhaps confusing half good now as we come. I'm asking advice or guidance on how I can get an RSS feed from a page that does not have RSS enabled by default. But that is not the problem itself. The problem is when on that page I am asked to enter a username and password. Well so otherwise would be the thing...
PROBLEM:
Get the RSS of a forum which does not have an RSS feed enabled and to see the 'news' we need to be logged.
POSSIBLE SOLUTIONS that come to mind:
There are several web sites which offer services in English to get RSS on pages where they are not. That's fine, but the problem is when these sites don't offer an option to login with a username and password to the web page where I want to get the info, so these types of sites are excluded.
I did not login via url and so put that url on web sites listed above (item 1) of the forum with the username and password variables directly from the url spec: www.forosinrss/login.php?usuario = me & password = your pff and I'm bounced the forum, telling me I'm not getting the correct data as we will be. Another problem is that the password is md5 encrypted, so I'm prevented from logging in with the URL (fk T_T).
Try using "SELECT * FROM DB Internet", or in other words, to use YQL. But it came out almost as much as they found no way to insert and log into user and password and also to generate a cookie for the forum is not happy I voted.
I need suggestions, recommendations, tips or complaints.
Download the page using something like cURL or fsockopen if you're feeling brave, then transform the page from html to rss using XSLT Stylesheets.
Once upon a time I wrote an app in PHP to do this with ok-ish results:
use curl to get the page and keep a copy
run a custom filter regular expression to select the bit of the page that actually matters (some sites have dynamic text like ads or just displaying the current date and time)
after a timeout, use curl to get the page again and run the same filter on it
run diff old_page, new_page and pipe the result into an rss template
The system worked ok but was fiddly filtering the page down to content that I wanted to get the feed from and it broke a lot because these kinds of sites are often hand edited so you can't guarantee any consistency.