rewriting links in scraped content using mod_rewrite

rewriting links in scraped content using mod_rewrite - php

I'm looking to create an iframe on my site that contains amazon.com, and I'd like to control it (see what product the user is at).
I realize I can't do this because of browser security policy issues, and the only real workaround is to feed the entire page through my server.
So I load the page and I change all the href values from something like
grocery-breakfast-foods-snacks-organic/b/ref=sa_menu_gro7?ie=UTF8&node=16310101&pf_rd_p=328655101&pf_rd_s=left-nav-1&pf_rd_t=101&pf_rd_i=507846&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=1S4N4RYF949Z2NS263QP
(the links on the site are relative) to 'me.com/work.php?link='.urlencode(theirlink).
The problem is the amount of time this takes - plus PHP runs frequently out of memory doing this.
Could I use mod_rewrite to rewrite all domains from:
http://www.me.com/grocery-breakfast-foods-snacks-organic/b/ref=sa_menu_gro7?ie=UTF8&node=16310101&pf_rd_p=328655101&pf_rd_s=left-nav-1&pf_rd_t=101&pf_rd_i=507846&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=1S4N4RYF949Z2NS263QP
to:
http://www.me.com/work.php?url=urlencode(thatlink)
And if not, are there any better options rather then going through every <a> tag?
Thanks!

Have you checked out the associates API? You can get your data that way.
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=498&categoryID=14
http://astore.amazon.com/

Related

how to redirect all subsite urls to one single url in a multi-site and also send a variable/value to this sub-site

I have a specific requirement and am looking for suggestions on the best possible way to achieve that. I would start by apologizing if I sound too naïve. What I am trying to achieve in here is:
A) I have a parent site, say, www.abc.com.
B) I am planning to enable multisite option for it. This parent site has a area map with a number of location images overlayed. All of these images, when clicked, should lead to a subsite.
C) This subsite (has already been coded) is totally dynamic and every single information being displayed on it is being extracted from the database. It uses a session variable, which for now has been hard-coded at the very beginning of the header. This variable also decides on which database to refer to. So it will display information for different locations, based on the location selected on the parent site. Even the URL should appear per that. Say if Location ‘A’ was clicked on parent-site then the session variable needs to set to ‘LocA’ on the sub-site and the URL should be something like www.abc.com/LocA and if the Location ‘B’ was clicked then the session variable should be set to ‘LocB’ and the URL should appear as www.abc.com/LocB etc.. Trying to figure out how to achieve this. [It will have one front-end for all the locations but different databases for each location.]
I am an entrepreneur with some programming experience from my past (but none related to website designing). Because of the help from all you geniuses and the code samples lying around, I was able to code the parent site and the sub-site (using html, php, js, css ). Now the trouble is how to put it all together and make it work in correlation. Though it will still be a week or two before I get to try it but I am trying to gather insights so that I am ready by the time I reach there. Any help will be deeply appreciated.

I think the fundamental thing to understand before you get deeper is what a URL is. A URL is not part of the content that you display to the user; nor is it the name of a file on your server. A URL is the identifier the user sends your server, which your server can use to decide what content to serve. The existence of "sub-sites", and "databases", and even "files" is completely invisible to the end user, and you can arrange them however you like; you just need to tell the server how to respond to different URLs.
While it is possible to have the same URL serve different content to different users, based on cookies or other means of identifying a user, having entire sites "hidden" behind such conditions is generally a bad idea: it means users can't bookmark that content, or share it with others; and it probably means it won't show up in search results, which need a URL to link to.
When you don't want to map URLs directly to files and folders, the common approach involves two things:
Rewrite rules, which essentially say "when the user requests URL x, pretend they requested URL y instead".
Server-side code that acts as a "front controller", looking at the (rewritten) URL that was requested, and deciding what content to serve.
As a simple example:
The user requests /abc/holidays/spain
An Apache server is configured with RewriteRule /(...)/holidays/(.*) /show-holidays.php?site=$1&destination=$2 so expands it to /show-holidays.php?site=abc&destination=spain
The show-holidays.php script looks at the parameter $_GET['site'] and loads the configuration for sub-site "abc"
It then looks at $_GET['destination'] and loads the appropriate content
The output of the PHP script is sent back to the user
If the user requests /def/holidays/portugal, they will get different content, but the same PHP script will generate it
Both the rewrite rules and the server-side script can be as simple or as complex as you like - some sites have a single PHP script which accepts all responses, looks at the real URL that was requested, and decides what to do; others have a long list of mappings from URLs to specific PHP scripts.

PhP/MySQL - change the URL based on content pulled from MySQL [duplicate]

I have http://mysite.com/index.php.
And a sub menu
home => http://mysite.com/index.php
about us => http://mysite.com/about.us.php
products => http://mysite.com/products.php
But i want http://mysite.com/index.php to process every request, and just change the content using Ajax request. This way, the site only loads the content part, and is much faster and easy to navigate.
The problem here is SEO, because the only URL google will see is http://mysite.com/index.php and I would like to associate http://mysite.com/about-us to the About Us content, http://mysite.com/product to the Products content, etc.
I know I can do this with PHP just reading the URL and writing the Ajax on the fly, but doing so the whole page is going to be reloaded every time.
Is there a way to do this without reloading the whole page?
What I think I need is to have a regular anchor in the submenu, for exampel pointing to "http://mysite.com/contact-us" but when clicked, instead of opening this page, process the Ajax request.
And if this is possible, Google is going to see this as black hat probably, right?
Regards
Alex

HERE THERE IS A SOLUTION:
window.history.pushState(data, title, url)
Here Rob explains how it works, and you have a working example:
http://moz.com/blog/create-crawlable-link-friendly-ajax-websites-using-pushstate

you can't change the URL in the address bar without changing the page because to be able to do that I couldlet you visit me at http://www.imhackingyou.com/sucker but change the addressbar to read http://www.bankofamerica.com/login

This is a routing issue, not an AJAX issue.
If you were using another tool (cough ASP.NET MVC cough), you'd just add a route (and I'm hopeful there's a way to do this in PHP) that accepted URLS like
/home
/products
...
and routed them to, say,
/index.php?area=home
/index.php?area=products
This is typically accomplished with a rewrite engine when used outside of a good MVC or RESTful URL system. I use ISAPI Rewrite on IIS, but if you're working on the LAMP stack, I think Apache provides a module that provides the same capabilities. (Google .htaccess )
WARNING: RANT FOLLOWS
And, for what it's worth,
Avoid trying to write your entire application in JavaScript. The server's there for a reason. Part of your job as a web developer is to absorb as much of the work onto your server as possible. Browser performance and compatibility issues will drive you mad when you try to do everything on the client.
Avoiding postbacks makes sense in a lot of circumstances, but it's not a silver bullet that you should try to apply to every page. Usually it makes sense to load a new page when a link is clicked. It's what the user expects, it's more stable (since most of the infrastructure required is server-side) and it's not slower than an AJAX request to retrieve the same thing.
Rules:
NEVER break the back button. Without careful planning, most AJAX apps break this rule.
See rule #1.

This sounds like it should be accomplished with a rewrite engine, but assuming that you have a good reason to use AJAX, you can change urls with javascript by modifying the portion after the hash, or better yet, the hashbang:
window.location.hash = "#!about-us";
http://mysite.com/
http://mysite.com/#!about-us
http://mysite.com/#!products
For more info on the hashbang from an SEO perspective, check out http://www.seomoz.org/blog/how-to-allow-google-to-crawl-ajax-content

How does Shopify do it then? Go to their website, click on the Features link and you'll see the URL says:
http://www.shopify.com/tour/sell-online
Then click on any of the sub links and you'll see that the address in the URl changes without using a hash but there is no page flip.
I don't think they are using ajax to change the content because it all appears to be included in hidden divs on the page, but regardless, you can apparently change the URL using client side tricks.

Using PHP or JavaScript (client side) - How to inject a web element into a live webpage (for a demo..)?

Let's say, that I want to demonstrate a widget (or some HTML in a frame) that would be "injected" into another page.
For example: I want to show the people in Amazon.com that I can put let's say a ball image underneath every price tag they put on their web page. That is - I want to build a web server (or indeed a server less html web page) that would show their page and put some stuff of mine inside theirs. So it looks as if the client (Amazon.com here) has my software already installed on their server.
I am a web-dev total newbie, so if this is the simplest thing in the world please, ..
Thanks

There's TONS of special cases that can cause this to fail, but I'll present a simple way that will work for you on a decent amount of webpages(but not all).
save the webpages html source into a local html file.
edit the html source, adding a <base href="http://www.amazon.com/"> tag into the <head> element.
make any other modifcations to the page you want, such as adding new <script> tags to support your new functionality. Make sure your modifications use absolute urls.
If they navigate away from the page, your enhancements will obviously not carry onto the next page. ALso, you will have more success if you upload the file onto a web server. While a user can view the page by double clicking on the html file if they were to save it locally, differences in javascript security permissions will likely make some webpages not function correctly.
The reason you need to add the <base> tag is because the browser resolves relative urls by looking at the url in its address bar. So, if the amazon page had an image like this
<img src="logo.png">
and you saved the html and put it on you webserver at www.example.com, the browser would look for the image at www.example.com/logo.png, which clearly doesn't exist. The base tag tells it what base url to use.
If you need more automation, having them install a browser addon would be a good way to do this if your users are somewhat technical. Greasemonkey is a popular addon, and you can tell it to inject stuff into certain webpages. The benefit of an addon is that it can inject the new functionality into any page on the web, without you having to individually save and modify them. Also, it has the potential to work on all web pages, leaving their functionality perfectly in tact, opposed to the other suggestion. This is far more complicated though.

create a php proxy page

I'm looking for a way to load a full-functional copy of a web site inside a php proxy page in order to be able to grab and change part of its elements and styles.
I decided to post this question to merge my previous two into a more relevant evolution:
live change any site visualization properties
load external site and change its visualization
I have found cURL functions useful to load the page (eg. www.google.it; for google.com I received a 302 redirection, but I won't face it now).
Some of the page elements, like the image logo, are not properly loaded; this should be due to the original relative path to the site resources. I have to manually add "//google.it" before them to fix, and it worked.
Now I have another issue:
How is it possible to go further in the site navigation?
When I click any link the page is reloaded with its "real" destination. I suppose I have to reload my php and use the href link attribute as url to load (I can do that).
But what about the submit buttons? How can I redirect their destination?

Use an existing proxy for that.
Generally you'll have to just find all the strings matching the old domain name and change them into your url, so every link on the page will turn from being www.bla.com/page.htm into proxy.com/page.htm.
This will also require some server setup thanks to possible ajax requests and relative paths. Besides, super hard would be to catch dynamically constructed url's such as: var add r = 'b'+'la.com';

What is the use of # in url

I realized that many of web app use # in their app's URL.
For example, Google Analytics.
This address is in the URL bar when I am viewing the visitor's language page:
https://www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
This address is in the address bar when I am viewing the visitors' geolocation page:
https://www.google.com/analytics/web/?hl=en#report/visitors-geo/a33185827w60383872p61754588/
I think that this is the Google Analytics web app passing #report/visitors-language and #report/vistiors-geo.
I know that Google analytics is using an <iframe>. It seems that only the main content box is changing when displaying content.
Is # used because of the <iframe> functionality?

There are several answers but none cover the backend part.
Here is a URL, one from your own example:
www.google.com/analytics/web/?hl=en#report/visitors-language/a33185827w60383872p61754588/
You can think about the post-hash (including the hash #) part as a client-side request.
The web server will never know what was entered after the hash sign. It is the browser pointing to a specific ID on the page.
For basic web pages, if you have this HTML: <a name="main">welcome</a>
on a web page at www.example.com/welcome, going to www.example.com/welcome#main will scroll your browser viewport to the welcome text in the <a> HTML tag.
The web server will not know whether #main was in the URL or not.
Values in the URL after a question mark are called URL parameters, e.g. www.example.com/?foo=bar. The web server can deliver different content based on those values.
However, there is a technology developed by Google called AJAX (Asynchronous JavaScript and XML) that makes use of the # part in the URL to deliver different content without a page load. It's not using an <iframe>.
Using JavaScript, you can trigger a change in the URL's post-hash part and make a request to the server to get a specific part of the page, for example for the URL www.example.com/welcome#main2 Even if an element named #main2 does not exist, you can show one using JavaScript.
A hashbang is #!. It is used to make search engine indexing easier by indicating that this part is a dynamic web page.

This is the "hash" in the url.
Many browsers support hash change event in javascript.
as per my knowledge the hash change is the revolution in the ajax callbacks.
as such when the user interacts with the any link with a hash then on the hash change the event is fired and you can apply any thing with the javascript.
one more thing is that hash change is supported by the browser history.

see below URL
SEO and the use of !# in a url
or Read it
'#! is called a "hashbang" and they are the root of all that is evil in web development.'
Basically, weak web developers decided to use #anchor names as a kludgy hack to get "web 2.0" things to work on their page, then complained to google that their page rank suffered. Google made a work around to their kludge by enabling the hashbang.
Weak web developers took this work around as gospel. Don't use it. It is a crutch.
Web development that depends on hashbangs is web-development done wrong.
This article is far more well worded than I could ever be, and deals with the Gawker media fiasco from their migration to a (failed) hashbang centric website. It tells you WHAT is happening and why it's bad.
http://isolani.co.uk/blog/javascript/BreakingTheWebWithHashBangs

Correct me if I'm wrong, the hashtag in that URL would be used as an anchor to scroll the page to an element with an id. For example, I send you to the url http://example.com/sample#example, and the page would scroll (just display) at the element (I'm using a div as an arbitrary example, it could be anything).

Ajax and hash mark in the url mostly used for quick action.
If you have a part in your site that can be visible only by fire event (mostly click) - it would be hard to share it. With hash mark in the url you can (by javascript) make the browser think that you did the required action and it will display the relevant part.

Normally the '#' is using in url will find the particular id which is next to '#' in that particular page. By using this we can view the particular content at middle of the page also.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.