how to fake url detection by php - php

im working on a script for indexing and downloading whole website by user sent url
for example when a user submit a domain like http://example.com then i will copy all links in index page and go for download the its inside links and start from first.....
i do this part with curl and regular expression to download and extract the links
however
some yellow websites are making fake urls for example if you go to http://example.com?page=12 it have some links to http://example.com?page=12&id=10 or http://example.com?page=13 and etc..
this will make a loop and the script cant complete the site downloading
is there any way to detect these kind of pages!?
p.s.: i think google and yahoo and some other search engines face this kind of problem too but their database are clear and on searches thay dont show these kind of data....

Some pages may use GET variables and be perfectly valid (like as you've mentioned here, ?page=12 and ?page=13 may be acceptable). So what I believe you're actually looking for here is a unique page.
It's not possible however to detect these straight from their URL. ?page=12 may point to exactly the same thing as ?page=12&id=1 does; they may not. The only way to detect one of these is to download it, compare the download to pages you've already got, and as a result find out if it really is one you haven't seen yet. If you have seen it before, don't crawl its links.
Minor side note here: Make sure you block websites from a different domain, otherwise you may accidentally start crawling the whole web :)

Related

How to mask url having many sub directories and files?

Am having an website with many directories and files. I just want to hide all the sub directories name and file names like https://example.com/folder_01/file.php to https://example.com. I could able to hide a single folder name using rewrite rule in htaccess apache server. Also I tried frame set concept but it shows unsafe script when tried to run the website in browser. Can anyone help me?
Thanks in advance.
This isn't possible.
A URL is how the browser asks the server for something.
If you want different things, then they need different URLs.
If what you desired was possible, then somehow the server would have to know that if my browser asked for / then it meant "The picture of the cat" while also knowing that if my browser asks for / then it means "The picture of the dog".
It would be like stopping at a fast food drivethru where you had never been before, didn't know anyone who worked there, and asking for "My usual" and expecting them to be able to know what that was.
You mentioned using frames, which is an old and very horrible hack that will keep a constant URL displayed in the browser's address bar but has no real effect beyond making life more difficult for the user.
They can still look at the Network tab in their browser's developer tools to see the real URL.
They can still right click a link and "Open in new tab" to escape the frames.
Links from search engines will skip right past the frames and index the URLs of the underlying pages which have actual content.
URLs are a fundamental part of the WWW. Don't try to break them. You'll only hurt your site.

scraperwiki: why does my scraper work for 1 url but not another?

This is my first scraper https://scraperwiki.com/scrapers/my_first_scraper_1/
I managed to scrape google.com but not this page.
http://subeta.net/pet_extra.php?act=read&petid=1014561
any reasons why?
I have followed the documentation from here.
https://scraperwiki.com/docs/php/php_intro_tutorial/
And there is no reason why the code should not work.
It looks like you are specifying to find a specific element. Elements change dependent on the site you are scraping. So if it doesn't find the element you are looking for you get no return. Also I would look into creating your own scraping/spidering tool with curl. Not only will you learn a lot but you will find out a lot about how to scrape sites.
Also a side not you might want to consider abiding by the robots.txt file on the website you are scraping from or ask permission before scraping as it is considered impolite.

Identify a file that contains a particular string in PHP/SQL site

By using the inspect element feature of Chrome I have identified a string of text that needs to be altered to lower case.
Though the string appears on all the pages in the site, I am not sure which file to edit.
The website is a CMS based on PHP and SQL - I am not so familiar with these programs.
I have searched through the files manually and cannot find the string.
Is there a way to search and identify the file I need using, for example, the inspect element feature on browsers or in FTP tool such as Filezilla?
Check if you have a layout page of any kind in your CMS. If you do, then most probably either in that file, or in the footer include file you will find either the javascript for google analytics, or a js include file for the same.
Try doing a site search for 'UA-34035531-1' (which is your google analytics user key) and see if it returns anything. If you find it, what you need would be two lines under it.
Usually people do not put analytics code in DB, so there is a bigger chance you will find it in one of the files, which most probably is included/embedded in a layout file of some sort, as you need it across all pages in the site

can a php piece of code that block old browsers from accessing a website block search engine spiders?

i was looking for a way to block old browsers from accessing the contents of a page because the page isn't compatible with old browsers like IE 6.0 and to return a message saying that the browser is outdated and that an upgrade is needed to see that webpage.
i know a bit of php and doing a little script that serves this purpose isn't hard, then i was just about to start doing it and a huge question popped up in my mind.
if i do a php script that blocks browsers based on their name and version is it impossible that this may block some search engine spiders or something?
i was thinking about doing the browser identification via this function: http://php.net/manual/en/function.get-browser.php
a crawler will probably be identified as a crawler but is it impossible that the crawler supplies some kind of browser name and version?
if nobody tested this stuff before or played a bit with this kind of functions i will probably not risk it, or i will make a testfolder inside a website to see if the pages there get indexed and if not i abandon this idea or i will try to modify it in a way that it works but to save me the trouble i figured it would be best to ask around and because i didn't found this info after a lot of searching.
No, it shouldn't affect any of major crawlers. get_browser() relies on the User-Agent string sent with the request, and thus it shouldn't be a problem for crawlers, which happen to use custom user-agent strings (eg: Google's spiders will have "Google" in their names).
Now, I personally think it's a bit unfriendly to completely block a website to someone with IE. I'd just put a red banner above saying "Site might not function correctly. Please update your browser or get a new one" or something to that effect.

How to show HTML pages instead of Flash to search engines

Let's say I have a plain HTML website. More than 80% of my visitors are usually from search engines like Google, Yahoo, etc. What I want to do is to make my whole website in Flash.
However, search engines can't read information from Flash or JavaScript. That means my web page would lose more than half of the visitors.
So how do I show show HTML pages instead of Flash to the search engines?
Note: you could reach a specific page/category/etc in Flash by using PHP GET function, for example: you can surf trough all the web pages from the homepage and link to a specific web page by typing page?id=1234.
Short answer: don't make your whole site in Flash.
Longer answer: If you show humans one view and the googlebot another, you are potentially guilty of "cloaking". If the Google Gods find you guilty, you will be banned to the Supplemental Index, never to be heard from again.
Also, doing an entire site in Flash breaks the basic contract of the web, namely that you can link to specific content from other sites or in emails. If your site has just one URL and everything else is handled inside of Flash ... well, I don't know what you have, but it isn't a website anymore. Adobe may like you, but many people will not. Oh, and Flash is very unfriendly to people with handicaps.
I recommend using Flash where it is needed (videos, animations, etc.), but make it part of an honest-to-God website.
What I want to do is to make my whole
website in Flash
So how to accomplish this: show HTML
pages instead of Flash?
These two seem a bit contradictory.
Important is to understand the reasoning behind choosing Flash to build your entire website.
More than 80 percent of my visitors
are usually from search engines
You did some analysis but did you look at how many visitors access your website via a mobile device? Because apart from SEO, Flash won't serve on the majority of these devices.
Have you considered HTML5 as an alternative for anything you want to do with Flash?
Facebook requires you to build applications in Flash among others but html, why? I do not know, but that is their policy and there has got to be a reason.
I have been recently developing simple social applications in Flash (*.swf) and my latest app is a website in flash that will display in tab of my company webpage in Facebook; at the same time, I also want to use that website as a regular webpage on the internet for my company. So, the only way I could find out to display html text within a flash file is by changing the properties for the text wherever I can in CHARACTER to "Render text as HTML", look for the symbol "<>". I think that way the search engines will be able to read your content and process your website accordingly. Good luck.
As you say that you can reach the Flash page by get variable using page ID or any other variables. So its good. I hope you will add Flash in each HTML page. Beside this, you can add all other HTML contents in hidden format. So the crawlers could reach the content and your site will look-up in Flash. Isn't it?
Since no-one actually gave you an straight answer (probably because your question is absolute face-palm-esque), i'll try:
Consider using the web-development approach called progressive enhancement. Now, it's fair to say that it probably wasn't intended for Flashification of a website, but you can make use of it's principles.
Start with your standard HTML version of your website
Introduce swfobject to dynamically (important bit) swap out the HTML content for it's Flash equivalent
Introduce swfaddress to allow for deep linking into your Flash movies (pseudo-URLs)
Granted, steps 2 and 3 are a little more advanced that how i've described them and your site size/structure/design may not suit this approach, but at least it's an answer.
All that being said, I agree with the other answers/comments about the need for using Flash to display your entire site - there's very very very few reasons anyone would do that, and there's more reasons than already added as to why not to (iOS devices etc)...

Categories