How copyscape uses google API?
The ajax api works only on browsers with javascript enabled, So this api is not used. The SOAP api is not used, because it is not allowed to be used for commercial use and no more than 100 queries are allowed per day.
Copyscape not uses Google api instead it uses Google search it does a simple curl request to http://www.google.com/search?q=Search Keywords here . Then uses regexp patterns to find title, descriptions and links and shows to user. But this strictly violates Google terms of service which can also get them ban, so they uses proxies(or any other ip hiding method) to hide their ip for each search
From their FAQ they have explained how they do it.
Where does Copyscape get its results?
Copyscape uses Google and Yahoo! as search providers, under agreed
terms. These search providers send standard search results to
Copyscape, without any post-processing. Copyscape uses complex
proprietary algorithms to modify these search results in order to
provide a ?plagiarism checking service. Any charges are for
Copyscape's value-added services, not for the provision of search
results by the search providers.
http://www.copyscape.com/faqs.php#providers
Analysis
CopyScape made us 100% sure that Google and Yahoo have special agreements. I am 80% sure that CopyScape are using a similar search solution (probably undisclosed but similar) to Google Enterprise Search provided by the search engines.
CopyScape does not do scraped results, but is fetching API based formats like json and xml. Which is good for the providers (Google and Yahoo) for bandwidth and response time improvements. I came up with this part due to my previous attempts to scrape google search results via python by phrase searches ("phrase matching"). Your scraping bot cannot and no known way to bypass 503 that google will respond after couple of hundred results (100 search intervals or 50 search intervals).
They obviously did not do some browser automation then fetching data between web drivers and programming languages like python. I have tried doing it and it gave similar results except that the automated searcher will need some manual intervention for the captcha which will then let you continue with the scraping. I also tried using some latest bypass which was patch in just minutes/seconds. Surely they did not do any automated scraping from search engines and if ever they are doing it. It will not work long term.
How they are using their special privilege?
Since they have paid off / have special terms they can now automate from the special APIs. They are either using Google Search Enterprise & Yahoo Search Marketing Enterprise or they have something more special solution.
Not Using List
Regular / Free APIs (Not sure if google and yahoo made it free for them)
Scrapers (Scrapy, Beautiful Soup, Selenium and Etc)
Using List
Enterprise Level API
Server Bash Scripts / Python Scripts / Ruby Scripts / PHP Scripts for scalabilities and such.
Hoping
I hope someone from CopyScape can leak information so that people won't be guessing and CopyScape should have more competition since there are only some plagarism checkers out there which are highly reliable and regarded (probably 1-10 only).
Related
I'm trying to make a PHP-based location search. I want it to be as 'smart' as possible, being able to find both addresses and hotels, musea etc.
Now I am currently using the Google Geocoding API, but the problem is that it can only seem to find addresses (when I input a hotel name it finds either nothing or some location on the other side of the planet).
I looked further and found the Places API, which can find all kinds of businesses and other locations. Problem is, I don't think (though correct me if I'm wrong) it can find normal adresses.
So my ideal situation would be being able to look for adresses AND other places at the same time. I would like to receive either a list of results sorted by relevance (determined by Google), or only the most relevant result.
Thanks in advance!
Wouter Florign,
Your current problem has a few components:
(1) the request/response from the Google Geocoding API
(2) the request/response from the Google Places API
(3) your workflow to process the responses/data from the above to API calls.
The main objective of your code is to maintain consistency between related and dependent objects without sacrificing code reusability (the continuation of your workflow is dependent upon your API responses). In order to ensure this, you should use the Observer pattern to wait for your requests to complete in order to continue your workflow. The reason for using the observer pattern and not using promises is that PHP is almost completely single-threaded. Beacuse of this, an implementation with a promise will block your script until it is complete.
If you feel more comfortable using promises, you can have your promise fork from the main script, (using the PCNTL family of functions). This will allow your promise code to run in the background, while the main script continues. It makes active use of pcntl_fork, which allows you to fork a new thread. When the Promise completes, it comes back. It has drawbacks - the biggest of them being the inability to message the main process by anything but signals.
Another caveat:
I implemented something very similar to this a couple of years ago. I believe I ran into the same problem;
In my case, I was able to leverage the Yelp API. This API is really fantastic]. All you have to do is perform a GET request on the Search API using the optional longitude, latitude parameter (it also has a radius parameter to limit your search). With this, I was able to get all kinds of information about businesses given the locations: hotels, restaurants, professional services [doctors, dentists, physical therapists]) and I was able to sort it based on various metrics (satisfactions, relevance, etc).
Please let me know if you have any questions!
Any ideas? I am new to php and am having a lot of trouble with curl and domdocuments so please write or show me an example. I was thinking of using dom documents but I can not figure out how to get amazon to search a users input from my site and display certain parts of the results such as price, category ex.....
There are several methods using file_get_contents, a "save html" plugin (https://simplehtmldom.sourceforge.io/) and CURL which I've had varying luck with, but eventually it starts flagging my requests with robot checks. I originally used the API, but Amazon locked that down to minimum traffic rules that I can't meet with my budding webservice rendering that useless.
There currently is no easy/effective way to consistently pull Amazon data though I'm playing with randomizing useragents and using proxies.
Use the Product Advertising API instead of scraping. http://docs.aws.amazon.com/AWSECommerceService/latest/DG/ItemSearch.html
The product API actually would be the best resource for this although it gives you limited results and after 180 days if no affiliate transaction occurs I believe they may revoke your access so it does limit you to some extent depending on your uses. Not sure but I think you may need a professional seller account or an affiliate membership, not 100% on that but that is my understanding.
Does anyone know of a piece of code that can run on a server that pipes the data from Apache logs into Google Analytics? I've got a bunch of websites that generate logs, but the users would likely object to injecting Google tracking codes into them. This might be a nice way to get the basics, what's being requested from where, and have it all sorted for me in with my other Google Analytics pages.
You can use the new measurement protocol (available for universal analytics account only) to implement a serverside solution.
Piping logs would probably not work very well (at least if you want to do a batch job - I don't think you can send a timestamp via the measurement protocol, so it would look as if all hits occured at the same time) but it shouldn't be necessary anyway, just create an url with the relevant parameters pointing to the google endpoint and sent it in the background via CURL (or similiar).
If you're in the European Union remember privacy guidelines still apply and you need to inform users and provide an opt-out link.
For non-Universal Analytics accounts, you can use php-ga - Server-Side Google Analytics Client -- it's essentially a server-side implementation of ga.js.
One caveat: If you want the location metrics to record something other than the location of your server, you'll need to log with a Google Analytics mobile tracking ID. Just replace the "UA" in the tracking ID with "MO", like "MO-12345678-1"
GA needs JavaScript, I think, so that various things like screen resolution can be grabbed. So even if this were possible, you'd be missing a good deal of info for some of your users, skewing your other percentages. Also, if your users are suspicious of Google, they probably would not want you to upload their IP addresses to GA.
With all that in mind, I wonder whether a self-hosted GA-like system would fit the bill? If so, try Piwik.
Are there any (free) speech-to-text API's that I could use with PHP? (I only know PHP and html/css.)
I'd like to send it an audio file, then have it return the transcription.
I haven't found any free APIs but there are a few relatively inexpensive ones:
Quicktate
PhoneTag
Twilio
The first two let you supply an MP3, whereas Twilio (which has the best rates) gets input through their own system, so your choice will depend on your application.
(You'll have to Google PhoneTag and Twilio; I can't post more than one link at my current reputation.)
Voice recognition is computationally rather expensive - it's defnitely not the kind of project you'd implement using PHP - OTOH you might create a web-based interface or integrate in a web / IVR type application using PHP as the glue (the voice search on Android is very cool).
So although there are some off-the-shelf toolkits available, you're probably going to be writing a lot of C code to do anything interesting with them. And how you get on depends a lot on the OS you are using (not stated - example link the first hit from Google).
Dynaspeak from SRI could work.
Once in a day we want to download google finance data for 6100+ stock symbols.
Right now we are going to this url and getting the data for all stock symbols.
http://www.google.com/finance?q=NYSE:AA&fstype=ii
Getting the data like this use lots of bandwidth and slowness on the server.
Is there better way to get the data from Google.
There is Google Finance API, but it is not available in PHP
The Google Finance API is available for use by all web scripting languages. All the API is is a description of and instructions on how to utilize their REST service. You use something like cURL in PHP to call to their REST service and retrieve the results output by it in XML format, and then parse the XML to display, store, whatever the information retreived.
Note that even though they don't have an example for PHP like they do for many of their APIs, their APIs still all use the same sort of system, so the examples provided for something like the Google Spreadsheets Data API or the Google Documents List API would be valid for getting a starting point on the Google Finance API. The difference between them would be the parameters passed, the data returned, and the url which you would be calling with the parameters to get the data.
If you can live without the data coming from Google, have you looked at the Yahoo Finance API? It's pretty flexible and allows the downloading of multiple symbols at once (though you may not necessarily want to do all 6100 at once).
For example, you can do:
http://finance.yahoo.com/d/quotes.csv?s=XOM+BBDb.TO+JNJ+MSFT&f=snd1l1yr
More detail on how to use the API is nicely written up at:
http://www.gummy-stuff.org/Yahoo-data.htm