Im in the process of creating a website where http reverse proxy is used to get around cross domain issues. Im think of using php curl or nodejs. For example http://my-proxyserver-example.com/www.yahoo.com would load www.yahoo.com into a DIV and there should be no cross domain issues in doing so, because all traffic is going via www.my-proxyserver-example.com
However if i do a search query using www.yahoo.com and see the results and I click on one of the links will results in a url not containing my proxy server address. Is there anyway a url request can be caught through some event handler and the proxy address inserted into the front of the url
I know that I could set up a proxy via the browser setting but this involves users setting up there web browsers to do this and id rather not take this solution. Could nodejs do this service before the web page is received ???
Any help will be appreciated
Related
I have a php application in which I scrape a website and get all of the links present in the site. While I am running the scraper in a tab of a browser and open the app in the other tab of the same browser, it keeps loading until the other tab processing(running scraper) is complete.
I have tried using ajax in this case i.e. I send the request through ajax post to find the links, but it is not effecting.
Any kind of help and guidance will be appreciated.
That is probably caused by the session lock. If your multiple connections (tabs) require the same session, you can't.
If they could be independent, then you would have to pass a session id in the URL to identify which tab is communicating with the server.
Note that the web server may also have restrictions configured on the number of simultaneous sessions from the same IP.
I am working on a scraping project to extract web data from a website. I have made a script to go through URLs and parse HTML contents and get the structured content into my database.The script was working fine,but recently the script got stuck and on investigation it was found that the target site is blocking our IP.
I am using PHP / CURL for this project,now I am getting a 403 error - Access Forbidden, error on a web request.
This has affected the working of my script,no pages could be retrieved from web request,every time I am getting an access restricting error.
I know there are lot of scraping etiquette's to be followed.Since we can't foresee how they had implemented the security features,I was confused on normalizing the web request calls.
I'm working on an amazon AWZ instance with an elastic IP,hence I am confused on when/whether they would lift the ban on my IP.
I have heard of rotating proxy methods to be used with scraping,such that the target server won't block you often.But I'm not sure about it's implementation.
Any help would be highly appreciated.I could provide any additional information if necessary.
sign in to the site to get an API id.
if you send a request to the site with API id and URL. it will send a request to the required URL with a random API and return a response.
just sign in and try it
signup
I want to extract some data from an HTML page.
I tried it with php, but I got an issue because this page is only available if you are connected to a specific network: unfortunately, my client is connected to that network, but not my server, so php requests fail.
My question is: if I try to scrape the page with javascript instead of php, will my request seem to come from my client network?
No it won't, unless you execute it in a browser which is already on your clients network! What you should checkout perhaps is a proxy or a VPN. Route your servers traffic through your clients network, this way it will appear to be coming from there IP address.
i have php proxy script which uses file_get_contents to get web sites and outputs it ...
everything is working as long as web sites are static, but as long as i use some sites that uses ajax requests to update it's content, lik twitter, 9gag, youtube ... new content doesn't get added
i get this error in console:
XMLHttpRequest cannot load http://9gag.com/new/json?list=hot&id=6408098. Origin is not allowed by Access-Control-Allow-Origin.
since 9gag site is now my local site served by my local proxy it can't access new content from original 9gag site, which this is cross domain issue ....
so my question is how do i take ajax requests and put them through my local proxy server?
This is a security feature. It is made to prevent such requests that you are trying to do. As I can see, you have only two possibilities:
Add site to hosts file to forward it to your proxy. It this way you have to ensure that your proxy responds correctly this way. But I don't know if there are some other checks browser-side except checking the domain. If only domain taken into account, everything will be ok.
Set OS to use your proxy site as a system proxy. This way you should make it to respond as a regular proxy server.
P.S. May be it is better to use some ready-to-use transparent proxy utility?
i need to fetch a url with javascript/jquery and not php.
i've read that you could do that if you got a php proxy, but that means that it is still going through php. cause then it's still the ip of the server that is fetching it.
could one fetch the url entirely with only front-end, and thus fetch it with the client's ip?
There exists a Same origin policy for AJAX requests. This prevents Javascript on, say, this site, from making a request to gmail.com (with your cookies), reading your e-mails, and uploading them to the StackOverflow server. Javascript on stackoverflow.com can only make AJAX requests to pages on that domain.
As you can see, this is essential for security. Requests must instead be made by a proxy running on your web server - PHP can be used, but there are other solutions. For example, Ajax Cross Domain is an AJAX library that communicates with a Perl script running on the server to emulate AJAX requests for other domains.
It is also possible to make requests on other domains via a javascript include (script tag), image tag, etc. but in these cases you cannot read the contents of the page.
You cannot do this with an iframe either: scripts cannot see the internals of iframes unless they are on the same domain as the script.
So in short, use a proxy.
The problem is that jQuery would fetch an url with AJAX and AJAX won't operate cross-domain because of the potential security (as per the same-origin policy).
There are however ways to emulate this, if you load the page in an iframe you can retrieve the data by using innerHTML on the iframe. Here's an example script that uses jQuery: http://code.google.com/p/jquery-crossframe/