Google Scholar profile scrape PHP

Google Scholar profile scrape PHP - php

I would like to scrap publications from google scholar profile with SimpleHtmlDom.
I have script for scraping the projects, but the problem is, that i am able to scrap only projects, that are shown.
When i am using url like this
$html->load_file("http://scholar.google.se/citations?user=Sx4G9YgAAAAJ");
there are shown only 20 projects. I can increase the number when i change the url
$html->load_file("https://scholar.google.se/citations?user=Sx4G9YgAAAAJ&hl=&view_op=list_works&pagesize=100");
by set the "pagesize" attribute. But the problem is, that 100 is maximum number of publications, what is webpage able to show.
Is there some way how to scrap all the projects from profile?

You cannot get all of the projects at once but you can get 100 projects at a time then get another 100 and so on, here is the URL
https://scholar.google.com/citations?user=Sx4G9YgAAAAJ&hl=&view_op=list_works&cstart=100&pagesize=100
In the above URL focus on cstart attribute, let's say you already grabbed 100 projects so now you will enter cstart=100 and grab another 100 list and then cstart=200 and so on until you get all of the publications.
Hope this helps

You have to pass additional pagination parameter to the request url.
cstart - Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize - Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
So, your URL should look like this:
https://scholar.google.com/citations?user=WLBAYWAAAAAJ&hl=en&cstart=100&pagesize=100
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example PHP code (available in other libraries also) to retrieve the second page of results:
require 'path/to/google_search_results';
$query = [
"api_key" => "secret_api_key",
"engine" => "google_scholar_author",
"hl" => "en",
"author_id" => "WLBAYWAAAAAJ",
"num" => "100",
"start" => "100"
];
$search = new GoogleSearch();
$results = $search->json($query);
Example JSON output:
"articles": [
{
"title": "Geographic localization of knowledge spillovers as evidenced by patent citations",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WLBAYWAAAAAJ&cstart=100&pagesize=100&citation_for_view=WLBAYWAAAAAJ:HGTzPopzzJcC",
"citation_id": "WLBAYWAAAAAJ:HGTzPopzzJcC",
"authors": "AB Jaffe, M Trajtenberg, R Henderson",
"publication": "Patents, citations, and innovations: a window on the knowledge economy, 155-178, 2002",
"cited_by": {
"value": 18,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=8561816228378857607",
"serpapi_link": "https://serpapi.com/search.json?cites=8561816228378857607&engine=google_scholar&hl=en",
"cites_id": "8561816228378857607"
},
"year": "2002"
},
{
"title": "IPR, innovation, economic growth and development",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WLBAYWAAAAAJ&cstart=100&pagesize=100&citation_for_view=WLBAYWAAAAAJ:70eg2SAEIzsC",
"citation_id": "WLBAYWAAAAAJ:70eg2SAEIzsC",
"authors": "AGZ Hu, AB Jaffe",
"publication": "Department of Economics, National University of Singapore, 2007",
"cited_by": {
"value": 17,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7886734392494692167",
"serpapi_link": "https://serpapi.com/search.json?cites=7886734392494692167&engine=google_scholar&hl=en",
"cites_id": "7886734392494692167"
},
"year": "2007"
},
...
]
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

Related

google place api photo references

I am using Google Api for getting photos from location .
i got the call and it working correctly.
but i have a problem.can be change google photo reference. because
i want to save image reference according to place id.
in future google photo reference will be change or not.
"places": [
{
"place_id": "ChIJheBKaKGuEmsRKk48IGMVojU",
"name": "Thai Spice House",
"lon": 151.228688,
"lat": -33.82916,
"address": "271 Military Road, Cremorne NSW 2090, Australia",
"images": [
"CnRoAAAAbMcJPxxWzU1pj_zSHqMtLlLBe2o6_pmd2ZHhJdxBO3UG4Q1BYxr4x834Bp5UmDrZhmSxVzeXb-nqHIYqLWcTdjQFFnuvp_DgK7c59wEvnu_AkH3KLNpqm4BtFw5wTWeZOgmwNnTEEoevb5-AxfsipxIQ9TnIAApazfKw1KuO7ZtEMhoUWs9FAN78M3O26af9StPMx3fej5E"
]
this image reference can be change in future?

Every value except place_id can potentially change at anytime.

How to retrieve biographical information of a person using Wikipedia's web API?

I am working on retrieving some particular bio details of a person from a Wikipedia page of that person through Wikipedia's web API.
I need to retrieve the bio information box of a person.
I found how to retrieve the content box , introduction paragraph and all. The below URL is used to retrieve the first introduction para of the wiki web page.
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sachin_Tendulkar
But I am stuck with getting the above bio information box through wiki web API, so that I could extract the specific details I want.
Is it possible to get a single item of information like only the full name or only the date of birth through a single query (instead of getting the whole information and extracting the details from it)?

Simple: you must not extract biographical data from Wikipedia directly, but from its structured data counterpart, Wikidata. See https://www.wikidata.org/wiki/Wikidata:Data_access for how.
In your example: date of birth is P569; the query is https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q42&property=P569
{
"claims": {
"P569": [
{
"id": "q42$D8404CDA-25E4-4334-AF13-A3290BCD9C0F",
"mainsnak": {
"snaktype": "value",
"property": "P569",
"datatype": "time",
"datavalue": {
"value": {
"time": "+1952-03-11T00:00:00Z",
"timezone": 0,
"before": 0,
"after": 0,
"precision": 11,
"calendarmodel": "http://www.wikidata.org/entity/Q1985727"
},
"type": "time"
}
},
etc.

Finding all pages containing images in Wikimedia Commons category via API

I'm currently trying to find all the pages where images/media from a particular category are being used on Wikimedia Commons.
Using the API, I can list all the images with no problem, but I'm struggling to make the query add in all the pages where the items are used.
Here is an example category with only two media images
https://commons.wikimedia.org/wiki/Category:Automobiles
Here is the API call I am using
https://commons.wikimedia.org/w/api.php?action=query&prop=images&format=json&generator=categorymembers&gcmtitle=Category%3AAutomobiles&gcmprop=title&gcmnamespace=6&gcmlimit=200&gcmsort=sortkey
The long term aim is to find all the pages the images from our collections appear on and then get all the tags from those pages about the images. We can then use this to enhance our archive of information about those images and hopefully used linked data to find relevant images we may not know about from DBpedia.
I might have to do two queries, first get the images then request info about each page, but I was hoping to do it all in one call.

Assuming that you don't need to recurse into subcategories, you can just use a prop=globalusage query with generator=categorymembers, e.g. like this:
https://commons.wikimedia.org/w/api.php?action=query&prop=globalusage&generator=categorymembers&gcmtitle=Category:Images_from_the_German_Federal_Archive&gcmtype=file&gcmlimit=200&continue=
The output, in JSON format, will looks something like this:
// ...snip...
"6197351": {
"pageid": 6197351,
"ns": 6,
"title": "File:-Bundesarchiv Bild 183-1987-1225-004, Schwerin, Thronsaal-demo.jpg",
"globalusage": [
{
"title": "Wikipedia:Fotowerkstatt/Archiv/2009/M\u00e4rz",
"wiki": "de.wikipedia.org",
"url": "https://de.wikipedia.org/wiki/Wikipedia:Fotowerkstatt/Archiv/2009/M%C3%A4rz"
}
]
},
"6428927": {
"pageid": 6428927,
"ns": 6,
"title": "File:-Fernsehstudio-Journalistengespraech-crop.jpg",
"globalusage": [
{
"title": "Kurt_von_Gleichen-Ru\u00dfwurm",
"wiki": "de.wikipedia.org",
"url": "https://de.wikipedia.org/wiki/Kurt_von_Gleichen-Ru%C3%9Fwurm"
},
{
"title": "Wikipedia:Fotowerkstatt/Archiv/2009/April",
"wiki": "de.wikipedia.org",
"url": "https://de.wikipedia.org/wiki/Wikipedia:Fotowerkstatt/Archiv/2009/April"
}
]
},
// ...snip...
Note that you will very likely have to deal with query continuations, since there may easily be more results than MediaWiki will return in a single request. See the linked page for more information on handling those (or just use an MW API client that handles them for you).

I don't understand your use case ("our collections"?) so I don't know why you want to use the API directly, but if you want to recurse in categories you're going to do a lot of wheel reinvention.
Most people use the tools made by Magnus Manske, creator of MediaWiki: in this case it's GLAMourous. Example with 3 levels of recursion (finds 186k images, 114k usages): https://tools.wmflabs.org/glamtools/glamorous.php?doit=1&category=Automobiles&use_globalusage=1&depth=3
Results can also be downloaded in XML format, so it's machine-readable.

Google Cloud Printing and Capabilities PPD

We are having some success printing via Googles Cloud Print service. But wondering if anyone has information regarding the capabilities parameter when submitting a job to print and some pointers in how to create and work this format which I believe is ppd.
We have been able to get the capabilities of the printer via using the method http://www.google.com/cloudprint/printer which returns all the values for our printer. The problem is we don't quite understand what we are meant to do with this in order to define the capability options we would like to print with. This would include options for the copies of pages printed, paper type and print quality. An example of the capabilities information we can receive is like this :
{
"name": "copies",
"displayName": "Copies",
"type": "ParameterDef"
}
{
"UIType": "PickOne",
"name": "HPEconoMode",
"displayName": "EconoMode",
"type": "Feature",
"options": [
{
"ppd:value": "\"\"",
"default": true,
"name": "PrinterDefault",
"displayName": "Printer's Current Setting"
},
{
"ppd:value": "\u003c\u003c/EconoMode true\u003e\u003e setpagedevice",
"name": "True",
"displayName": "Save Toner"
},
{
"ppd:value": "\u003c\u003c/EconoMode false\u003e\u003e setpagedevice",
"name": "False",
"displayName": "Highest Quality"
}
]
}

The GCP documentation is badly lacking in this regard. Anyway, I've managed to find that the correct parameter to send printer settings is ticket, not capabilities. The first part of the parameters corresponds to the basic settings from the print dialog and they are quite self-explanatory and the values are easy to change. The vendor_ticket_item array is a bit more complicated. It contains id/value pairs described by the printer capabilities. The id will contain the name of the parameter from the capabilities and the value will contain the name of one of the records in the parameter options, or a numeric value etc, as described in the capabilities.
For mode details please take a look at my full solution.
{
"version":"1.0",
"print":{
"color":{"vendor_id":"psk:Color","type":0},
"duplex":{"type":0},
"page_orientation":{"type":1},
"copies":{"copies":1},
"dpi":{"horizontal_dpi":600,"vertical_dpi":600},
"media_size":{"width_microns":148000,"height_microns":210000,"is_continuous_feed":false},
"collate":{"collate":true}
,
"vendor_ticket_item":[
//Printer specific settings here, from the capabilities:
{"id":"psk:JobInputBin","value":"ns0000:Tray3"},
{"id":"psk:PageICMRenderingIntent","value":"psk:Photographs"},
{"id":"psk:PageMediaType","value":"ns0000:Auto"},
{"id":"psk:JobOutputBin","value":"ns0000:Auto"},
//etc.
]
}
}

How to recognise adult content programmatically?

I am currently developing a website for a client. It consists of users being able to upload pictures to be shown in a gallery on the site.
The problem we have is that when a user uploads an image it would obviously need to be verified to make sure it is safe for the website (no pornographic or explicit pictures). However my client would not like to manually have to accept every image that is being uploaded as this would be time consuming and the users' images would not instantly be online.
I am writing my code in PHP. If needs be I could change to ASP.net or C#. Is there any way that this can be done?

2019 Update
A lot has changed since this original answer way back in 2013, the main thing being machine learning. There are now a number of libraries and API's available for programmatically detecting adult content:
Google Cloud Vision API, which uses the same models Google uses for safe search.
NSFWJS uses TensorFlow.js claims to achieve ~90% accuracy and is open source under MIT license.
Yahoo has a solution called Open NSFW under the BSD 2 clause license.
2013 Answer
There is a JavaScript library called nude.js which is for this, although I have never used it. Here is a demo of it in use.
There is also PORNsweeper.
Another option is to "outsource" the moderation work using something like Amazon Mechanical Turk, which is a crowdsourced platform which "enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do". So you would basically pay a small amount per moderation item and have an outsourced actual human to moderate the content for you.
The only other solution I can think of is to make the images user moderated, where users can flag inappropriate posts/images for moderation, and if nobody wants to manually moderate them they can simply be removed after a certain number of flags.
Here are a few other interesting links on the topic:
http://thomas.deselaers.de/publications/papers/deselaers_icpr08_porn.pdf
http://www.naun.org/multimedia/NAUN/computers/20-462.pdf
What is the best way to programmatically detect porn images?

The example below does not give you 100% accurate results but it should help you a least a bit and works out of the box.
<?php
$url = 'http://server.com/image.png';
$data = json_decode(file_get_contents('http://api.rest7.com/v1/detect_nudity.php?url=' . $url));
if (#$data->success !== 1)
{
die('Failed');
}
echo 'Contains nudity? ' . $data->nudity . '<br>';
echo 'Nudity percentage: ' . $data->nudity_percentage . '<br>';

If you are looking for an API-based solution, you may want to check out Sightengine.com
It's an automated solution to detect things like adult content, violence, celebrities etc in images and videos.
Here is an example in PHP, using the SDK:
<?php
$client = new SightengineClient('YourApplicationID', 'YourAPIKey');
$output = $client>check('nudity')>image('https://sightengine.com/assets/img/examples/example2.jpg');
The output will then return the classification:
{
"status": "success",
"request": {
"id": "req_VjyxevVQYXQZ1HMbnwtn",
"timestamp": 1471762434.0244,
"operations": 1
},
"nudity": {
"raw": 0.000757,
"partial": 0.000763,
"safe": 0.999243
},
"media": {
"id": "med_KWmB2GQZ29N4MVpVdq5K",
"uri": "https://sightengine.com/assets/img/examples/example2.jpg"
}
}
Have a look at the documentation for more details: https://sightengine.com/docs/#nudity-detection
(disclaimer: I work there)

There is a free API that detects adult content (porn, nudity, NSFW).
https://market.mashape.com/purelabs/sensitive-image-detection
We've using it on our production environment and I would say it works pretty good so far. There are some false detections though, it seems they prefer to mark the image as unsafe if they are unsure.

It all depends on the level of accuracy you are looking for, simple skin tone detection (like nude.js) will prob get you 60-80% accuracy on a generous sample set, for anything more accurate than that, let's say 90-95%, you are going to need some specialized computer vision system with an evolving model that is revised over time. For the latter you might want to check out http://clarifai.com or https://scanii.com (which I work on)

Microsoft Azure has a very cool API called Computer Vision, which you can use for free (either through the UI or programmatically) and has tons of documentation, including for PHP.
It has some amazingly accurate (and sometimes humorous) results.
Outside of detecting adult and "racy" material, it will read text, guess your age, identify primary colours, etc etc.
You can try it out at azure.microsoft.com.
Sample output from a "racy" image:
FEATURE NAME: VALUE:
Description { "tags": [ "person", "man", "young", "woman", "holding",
"surfing", "board", "hair", "laying", "boy", "standing",
"water", "cutting", "white", "beach", "people", "bed" ],
"captions": [ { "text": "a man and a woman taking a selfie",
"confidence": 0.133149087 } ] }
Tags [ { "name": "person", "confidence": 0.9997446 },
{ "name": "man", "confidence": 0.9587285 },
{ "name": "wall", "confidence": 0.9546831 },
{ "name": "swimsuit", "confidence": 0.499717563 } ]
Image format "Jpeg"
Image dimensions 1328 x 2000
Clip art type 0
Line drawing type 0
Black and white false
Adult content true
Adult score 0.9845981
Racy true
Racy score 0.964191854
Categories [ { "name": "people_baby", "score": 0.4921875 } ]
Faces [ { "age": 37, "gender": "Female",
"faceRectangle": { "top": 317, "left": 1554,
"width": 232, "height": 232 } } ]
Dominant color background "Brown"
Dominant color foreground "Black"
Accent Color #0D8CBE

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.