I am working on a scraping project to extract web data from a website. I have made a script to go through URLs and parse HTML contents and get the structured content into my database.The script was working fine,but recently the script got stuck and on investigation it was found that the target site is blocking our IP.
I am using PHP / CURL for this project,now I am getting a 403 error - Access Forbidden, error on a web request.
This has affected the working of my script,no pages could be retrieved from web request,every time I am getting an access restricting error.
I know there are lot of scraping etiquette's to be followed.Since we can't foresee how they had implemented the security features,I was confused on normalizing the web request calls.
I'm working on an amazon AWZ instance with an elastic IP,hence I am confused on when/whether they would lift the ban on my IP.
I have heard of rotating proxy methods to be used with scraping,such that the target server won't block you often.But I'm not sure about it's implementation.
Any help would be highly appreciated.I could provide any additional information if necessary.
sign in to the site to get an API id.
if you send a request to the site with API id and URL. it will send a request to the required URL with a random API and return a response.
just sign in and try it
signup
Related
I have developed REST API in core php. This APIs are used in mobile app to fetch data from server.
Now we have a situation where 10000 users are using the app at the same time. When all this users are on app at same time, the server(Amazon EC2 UBUNTU 12.04) fails.
In order to solve this, we have decided to use CloudFlare. After lot of research, it is still not clear how to use cloudflare to cache the response coming from APIs.
Below are few links that I have followed so far:
https://support.cloudflare.com/hc/en-us/articles/202775670-How-Do-I-Tell-Cloudflare-What-to-Cache-
https://blog.cloudflare.com/introducing-pagerules-advanced-caching/ .
We have already set CNAME and HOST details on our CloudFlare account. Can someone help me to know what implementation is needed. Or if this is possible at all or no?
After a lot of research, I found that it is possible to cache the response of REST API. We just need to create a custom page rule.
Just in case someone else is facing the same issue then follow below steps:
1) Get the domain name for your URL. For example, if your API URL is http://xx.xx.xx.xx then you need to get a domain and link it so that your API URL can be http://domainname/...
2) Since the data is not HTML content or CSS, you should create a custom page rule. It is detailed very nicely by cloud flare but is hard to find the link. SO here is the link for the same: https://support.cloudflare.com/hc/en-us/articles/115000150272-How-do-I-use-Cache-Everything-with-Cloudflare-
The entire setup with Cloudflare is done and the performance of my server has improved drastically. We just need to follow the steps carefully!!!
As per the title - I'm trying to trigger an Azure Website "triggered" WebJob from our custom PHP deployment application hosted external to Azure websites.
Thanks to what I believe is Active Directory, I'm able to navigate the /api URLs in my browser and get a JSON output without having to reauthenticate. For example, /api/triggeredwebjobs outputs the triggered WebJob information (that I've set up inside Azure Portal) in my browser.
I've gotten as far in my PHP app as sending a POST request and it is successfully authenticating using basic auth, but every single /api URL that I set in my PHP app returns:
"No route registered for '/api/triggeredwebjobs/{webjobname}'"
where {webjobname} is my custom name for the web job, hidden for privacy of the client. Every URL returns this, but if I navigate in my browser, I only get that error if I navigate to a URL that doesn't exist, such as /api/blahblahblah.
I've set up a deployment user which is what it's using to authenticate... I've even logged in to https://{azuresite}.scm.azurewebsites.net/basicauth using the deployment user and successfully gotten output from each /api page in my browser.
If it helps, I'm using Httpful.phar to handle the HTTP requests.
Thank you very much for taking the time to read and possibly assist.
A colleague helped me get to the bottom of this - the documentation was out of date. I have opened an issue on the Kudu Github Repo to get them to review this:
https://github.com/projectkudu/kudu/issues/1466
To solve the issue for future readers of this question, the correct URL to use within the requester app is:
https://{yoursite}.scm.azurewebsites.net/jobs/triggered/{jobname}/run
Good luck!
I'm working on a simple geolocation tracker based on this project and OsmAnd. OsmAnd uses URL parameters to send geolocation data to the web service, which then writes it to a file. The URL input into OsmAnd looks like
http://example.com/tracker.php?key=j2R1nrQ&lat={0}&lon={1}×tamp={2}&hdop={3}&altitude={4}&speed={5},
where the {#} is replaced by the location data by OsmAnd.
I have confirmed that OsmAnd is pinging my site with correctly formed data. If I open the link in a browser, it is correctly writing the data to the file, but when the app on my phone pings the page, it is not. PHP is run on the server, right? So why would it make a difference that an app on my phone is pinging the site vs. my browser?
I figured it out. The app was generating a 406 error on the server, which is caused by Apach Mod_security. Disabling Mod_security solved the issue.
Im in the process of creating a website where http reverse proxy is used to get around cross domain issues. Im think of using php curl or nodejs. For example http://my-proxyserver-example.com/www.yahoo.com would load www.yahoo.com into a DIV and there should be no cross domain issues in doing so, because all traffic is going via www.my-proxyserver-example.com
However if i do a search query using www.yahoo.com and see the results and I click on one of the links will results in a url not containing my proxy server address. Is there anyway a url request can be caught through some event handler and the proxy address inserted into the front of the url
I know that I could set up a proxy via the browser setting but this involves users setting up there web browsers to do this and id rather not take this solution. Could nodejs do this service before the web page is received ???
Any help will be appreciated
I am currently in the process of creating a mobile version of my web app.
The app is being developed with Facebook's PHP Client Library.
The issue:
I am using the following mobile url to allow users to log in using the mobile devices:
http://m.facebook.com/tos.php?api_key=APIKEY&v=1.0&next=http%3A%2F%2Ftweelay.net%2Fm.php&cancel=http%3A%2F%2Ftweelay.net%2Fm.php
APIKEY being my app's actual Facebook API key.
In the url I am telling Facebook to redirect the user back to http://tweelay.net/m.php when the user signs in or clicks cancel on the log in screen. I am pulling my hair trying to figure out why it keeps sending the user to http://m.tweelay.net/m.php which is currently an invalid end point.
I have gone through all of my app's settings on Facebook and I cant find any that reference http://m.tweelay.net and going through all of my source code I cant find any that reference the m. sub-domain either.
Any ideas? Is there a setting I'm missing? Maybe a Flag in the library?
I've seen Facebook do this when detecting the mobile browser type and also sometimes randomly through Firefox (it can also happen when trying to get to facebook.com). I've managed to reset it sometimes, but it's not a guaranteed fix.
If you want to be sure the user makes it to your correct site I suggest creating the subdomain and redirecting traffic to your usual site, it's what I did and now I don't worry about it reverting back.