Scraping a web page: Javascript?

Scraping a web page: Javascript? - php

I want to extract some data from an HTML page.
I tried it with php, but I got an issue because this page is only available if you are connected to a specific network: unfortunately, my client is connected to that network, but not my server, so php requests fail.
My question is: if I try to scrape the page with javascript instead of php, will my request seem to come from my client network?

No it won't, unless you execute it in a browser which is already on your clients network! What you should checkout perhaps is a proxy or a VPN. Route your servers traffic through your clients network, this way it will appear to be coming from there IP address.

Related

How to change ip-user agent or proxy in php hosting?

Using the simple php curl function for Facebook user-account control, I pull out the site and do the detection according to the incoming data.
But because I have multiple queries, Facebook blocks and php codes are disabled. How can I show each browser function as if it was entered from a different computer by modifying the browser ip-user agent (if there is a proxy) before running it?
Thank you.

Your trying to ask that your ip is blocked to get data through API so that you are trying to fetch data from different ip[proxy]. If this is your concern then try to find why your ip has blocked and get whitelist your ip from FB!!!!!
First, access canhazip.com or jsonip.com from the server to make sure it has the public IP you think.
Second, make sure that IP address is in "Server IP Whitelist" for the app's Settings > Advanced section in the Developer console (https://developers.facebook.com/apps/[APP ID]/settings/advanced/).

PHP open a running app in multiple tabs in same browser

I have a php application in which I scrape a website and get all of the links present in the site. While I am running the scraper in a tab of a browser and open the app in the other tab of the same browser, it keeps loading until the other tab processing(running scraper) is complete.
I have tried using ajax in this case i.e. I send the request through ajax post to find the links, but it is not effecting.
Any kind of help and guidance will be appreciated.

That is probably caused by the session lock. If your multiple connections (tabs) require the same session, you can't.
If they could be independent, then you would have to pass a session id in the URL to identify which tab is communicating with the server.
Note that the web server may also have restrictions configured on the number of simultaneous sessions from the same IP.

URL redirection event handler

Im in the process of creating a website where http reverse proxy is used to get around cross domain issues. Im think of using php curl or nodejs. For example http://my-proxyserver-example.com/www.yahoo.com would load www.yahoo.com into a DIV and there should be no cross domain issues in doing so, because all traffic is going via www.my-proxyserver-example.com
However if i do a search query using www.yahoo.com and see the results and I click on one of the links will results in a url not containing my proxy server address. Is there anyway a url request can be caught through some event handler and the proxy address inserted into the front of the url
I know that I could set up a proxy via the browser setting but this involves users setting up there web browsers to do this and id rather not take this solution. Could nodejs do this service before the web page is received ???
Any help will be appreciated

How to ensure the HTTP_REQUEST Is coming from the right place?

I learn that HTTP_REFERER or any HTTP request header can be fake and not reliable.
REMOTE_ADDR is reliable though.
so, how can I ensure the incoming HTTP_REQUEST call is coming from a website that I white-list?
For example, I have a js code that will send from client site to server. (something like a sniper, cross platform). however, I only allow this happen from several websites. Not others. so, even other people copy the code and put onto their website, it won't work.

In the general case you simply can't do it. You are entirely at the mercy of the client. You can make it more difficult by checking the referrer, but not impossible.

The only way to do this reliably is to have all those several websites generate unique tokens for every users, similarly as how you protect yourself from CSRF attacks. The tokens would then be sent along with the request by your script, and your server would need to have a way to check the token for authenticity against the other websites. Needless to say this is very likely impossible unless you control all sites.
See also this question on HTTP_REFERER

Haven't used this in practice, so there might be practicality issues I wasn't counting on, but thought I'd contribute the idea anyway. If I interpret correctly, this is similar to (if not the same as) the idea #Seldaek posted.
Your Server generates a unique ID for each page-serve and embeds the ID in the page.
Server stores the ID and the Client's IP address.
The js on the client places the ID in its request to the Server and sends the request.
When the Server receives the js request from the Client, it only responds if the IP/ID pair matches one that is on-file (see #2).
After some specified time (and/or when the browser session ends), the ID/IP entries expire.
This could perhaps be faked if a person sharing the visitor's IP address (perhaps both are behind the same NAT box) hijacks another visitor's session in real-time, but it will at least prevent someone from making another web page which piggybacks on your server's service.
There could also be issues if, for some reason, your visitor's IP address changes between when the page was served and when the js request was sent.
Basically, your server is saying "I will not service your js request unless you possess the data from a page I recently served and you are coming from (to the best of my knowledge) the place to which I served that page."

All http headers can be faked.
If you are just accepting communication from the remote server (and not having a client browser be redirected to your server) then you can either set up a VPN between that remote server and yours or you can change your firewall config to only allow communication from a specific set of IP addresses. However, even the later can be faked by people willing to go that far.
If the client browser is the one either being redirected to your server or loading the file(s) from your server then there is absolutely nothing you can do.

As #Billy says this simply isn't possible, you're thinking about the internets' request response mechanism incorrectly.
For example, I have a js code that
will send from client site to server.
(something like a sniper, cross
platform).
I assume what you're saying is that you have some javascript code served up on some website on your 'whitelist' which redirects the user to your website. Its on your website that you want to check that the user came from the 'whitelisted' site?
Aside from setting a cookie (might not be possible - cross domains) you might find it tough. Have you taken a look at OpenID? If you can post more details a solution may be more obvious.

so, how can I ensure the incoming
HTTP_REQUEST call is coming from a
website that I white-list?
I think if you sign every request(from whitelist) which is valid for that request only(once). I assume using uniqid for this is safe(enough?).

php cURL or file_get_content affect on google analytics

Im wondering what affect loading an external page with php has on a sites analytics. If php is loading an external page, and not an actual browser, will the javascript that reports back to google analytics register the page load as a hit?

Any JavaScript within the fetched page will not be run and therefore have no effect on analytics. The reason for this is that the fetched HTML page is never parsed in an actual browser, therefore, no JavaScript is executed.

Curl will not automatically download JavaScript files the HTML refers to. So unless you explicitly download the Google Analytics JavaScript file, Google won't detect the Curl hit.

Google offers a non-JavaScript method of tracking hits. It's intended for mobile sites, but may be repurposable for your needs.

You're misunderstanding how curl/file_get_contents work. They're executed on the server, not on the client browser. As far as Google and any regular user is concerned, they'll see the output of those calls, not the calls themselves.
e.g.
client requests page from server A
server A requests page from server B
server B replies with page data to server A
server A accepts page data from server B
server A sends page data to client
Assuming that all the requests work properly and don't issue any warnings/errors and there's no network glitches between server A and server B, then there is absolutely no way for the client to see exactly what server A's doing. It could be sending a local file. It could be executing a local script and send its output. It could be offshoring the request to a server in India which does the hard work and then simply claims the credit for it, etc...
Now, you CAN get the client to talk to server B directly. You could have server A spit out an HTML page that contains an iframe, image tag, script tag, css file, etc... that points to server B. But that's no longer transparent to the client - you're explicitly telling the client "hey, go over there for this content".

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping a web page: Javascript? - php

No it won't, unless you execute it in a browser which is already on your clients network! What you should checkout perhaps is a proxy or a VPN. Route your servers traffic through your clients network, this way it will appear to be coming from there IP address.

Related

How to change ip-user agent or proxy in php hosting?

PHP open a running app in multiple tabs in same browser

URL redirection event handler

How to ensure the HTTP_REQUEST Is coming from the right place?

php cURL or file_get_content affect on google analytics

Categories

Resources