Scraping .php websites

Scraping .php websites - php

I'v been having trouble scraping the following website content: http://www.qe.com.qa/wp/mw/MarketWatch.php
using file_get_contents() never gets me the right tag. I would like to scrape the content of the following tag: td aria-describedby="grid_OfferPrice"
is the website protected from scraping? because when I try the same method with diffrent websites it works. if yes, then what is a good work around for this ?

The way to see if scraping works is to output what file_get_contents returns. If you have nothing back or an error then maybe your IP has been restricted by their admin.
If it returns their source code then it's working but maybe the tag you're looking for has not been found.
Eliminate failures in your process by answering these questions first, one at a time.
I viewed their source code and the aria attribute you are searching for doesn't appear to exist.
It seems they load the data on that page from another source which is at this page (http://www.qe.com.qa/wp/mw/bg/ReadData.php?Types=SELECTED&iType=SO&dummy=1401401577192&_search=false&nd=1401401577279&rows=100&page=1&sidx=&sord=asc)
If you want the data from that page then use file_get_contents on it directly.
The data from that page in an online json editor gives you a neat way of quickly seeing whether this is a useful solution for you.
A sample of that data is listed below:
{
"total": "140",
"page": "1",
"records": "140",
"rows": [
{
"Topic": "QNBK/NM",
"Symbol": "QNBK",
"CompanyEN": "QNB",
"CompanyAR": "QNB",
"Trend": "-",
"StateEN": "Tradeable",
"StateAR": "المتداوله",
"CatEN": "Listed Companies",
"CatAR": "الشركات المدرجة",
"SectorEN": "Banks & Financial Services",
"SectorAR": "البنوك والخدمات المالية",
"ShariahEN": "N/A",
"ShariahAR": "N/A",
"OfferVolume": "7503",
"OfferPrice": "184.00",
"BidPrice": "182.00",
"BidVolume": "15807",
"OpenPrice": "190.0",
"High": "191.7",
"Low": "181.0",
"IMP": "182.0",
"LastPrice": "182.0",
"PrevClosing": "187.0",
"Change": "-5.0",
"PercentChange": "-2.6737",
"Trades": "980",
"Volume": "2588830",
"W52High": "199.0",
"W52Low": "145.0",
"Value": "481813446.4"
},
{
"Topic": "QIBK/NM",
"Symbol": "QIBK",
"CompanyEN": "Qatar Islamic Bank",
"CompanyAR": "المصرف ",
"Trend": "+",
"StateEN": ...
Make sure you read this link about 'scraping' etiquette.

Link: http://simplehtmldom.sourceforge.net/
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents("EXAMPLE.COM");
$items = $dom->getElementsByTagName("YOUR TAG");
This class allows you to search HTML code for elements. I have used it a few times before and it is by far the best solution I have found for your issue.

Related

How to Create CKReference using CloudKit Web Services?

I've using CK Web services very successfully, but I am stumped about how to create a CKReference.
I've looked at the docs here re Reference Dictionaries, but can't make such a dictionary work.
My php generates the following operations dictionary:
{"operations":[
{"operationType": "create",
"record": { "recordType": "Works",
"fields": {
"type":{"value":"Painting"},
"title": {"value":"test"},
"date": {"value":"10/29/1965"},
"height": {"value":"21"},
"length": {"value":"21"},
"width": {"value":"21"},
"runningTime": {"value":""},
"materials": {"value":"test"},
"description":{"value":"test"},
"saleStatus": {"value":"yes"},
"tos":{"value":"yes"},
"artist": {"value":"Peter Wiley"},
"artistRecordName":{"value":"286CB3BF-69CC-4DD3-9233-CC80E5FA95D4"},
"artistRecordRef": {
"recordName": {"value":"286CB3BF-69CC-4DD3-9233-CC80E5FA95D4"},
"zoneID":{"zoneName": {"value":"_defaultZone"}},
"action": {"value":"NONE"}
},
"subject":{"value":""},
"metaType":{"value":"Fine Art"},
"userRecordName":{"value":"30C54AD8-3701-428C-99B7-0393DD2DAB45"},
"userRole":{"value":"Artist"},
"status":{"value":"P"}
}
} }
]}
This request returns the error:
BAD_REQUEST" [1]=> string(62) "BadRequestException: Unexpected input
at [line: 26, column: 3]
If I remove the "artistRecordRef" the request works as it should.
I am sure the answer is obvious to a more experienced eye. Can someone see what's wrong?

OK, I found the answer here, but have posted for others who may have the question because the answer was not easy to find.
This is what works:
"artistRecordRef": {"value": {
"recordName": "'.$artistRecordName.'",
"action": "NONE"
}
},
The Reference Dictionary has to be passed as a value. I didn't get this and it's not well documented with examples in the Apple docs (at least in those I was able to find).
See: How can I use CloudKit web services to query based on a reference field?

Any way in OneNote API to extract link to another OneNote page?

If we have a link to another OneNote page in the HTML content:
<a href="onenote:SectionB.one#Note1&section-id={<section-id>}&page-id={<page-id>}&end&base-path=https://<path>"
... before I write a parsing routine to extract that link, I thought I'd ask if I'd overlooked anything in the OneNote API to make this easier.
===========================================================================
[EDIT] Well, I've written my routine to extract the page-id of the linked note, but that page-id turns out to be quite different from the page-id that's returned as a property (id) of the linked note itself - and it doesn't work :(
Here's an example:
(1) page-id extracted from link: A8CECE6F-6AD8-4680-9773-6C01E96C91D0
(2) page-id as property of note:
0-5f49903893f048d0a3b1893ef004411f!1-240BD74C83900C17!124435
Vastly different, as you see. Accessing the page content via:
../pages/{page-id}/content
... for (1) returns nothing
... for (2) returns the full page content.
(The section-ids returned by both methods are also entirely different.)
So, how can I extract from the link a page-id that works?

Unfortunately, the OneNote API currently does not support identifying links to other OneNote pages in page content. Links in OneNote can be links to anything: websites, other OneNote pages/sections/notebooks, network shares...
The API does support getting links to pages by using
GET ~/pages
GET ~/sections/id/pages
The page metadata model contains a links object with the clientUrl and the webUrl.
Editing after your question update:
You're right - the id in the link does not correspond to the id of the OneNote API. You can however compare the id in the link with the id in the OneNoteClientUrl exposed in the API. Here's an example of the response of a
GET ~/sections/id/pages
GET ~/pages
{
"title": "Created from WAC",
"createdByAppId": "",
"links": {
"oneNoteClientUrl": {
"href": "onenote:https://d.docs.live.net/29056cf89bb2d216/Documents/TestingNotification/Harrie%27s%20Section.one#Created%20from%20WAC&section-id=49b630fa-26cd-43fa-9c45-5c62d547ee3d&page-id=a60de930-0b03-4527-bf54-09f3b61d8838&end"
},
"oneNoteWebUrl": {
"href": "https://onedrive.live.com/redir.aspx?cid=29056cf89bb2d216&page=edit&resid=29056CF89BB2D216!156&parId=29056CF89BB2D216!105&wd=target%28Harrie%27s%20Section.one%7C49b630fa-26cd-43fa-9c45-5c62d547ee3d%2FCreated%20from%20WAC%7Ca60de930-0b03-4527-bf54-09f3b61d8838%2F%29"
}
},
"contentUrl": "https://www.onenote.com/api/v1.0/me/notes/pages/0-a50842a9873945379f3d891a7420aa39!14-29056CF89BB2D216!162/content",
"thumbnailUrl": "https://www.onenote.com/api/v1.0/me/notes/pages/0-a50842a9873945379f3d891a7420aa39!14-29056CF89BB2D216!162/thumbnail",
"lastModifiedTime": "2016-03-28T21:36:22Z",
"id": "0-a50842a9873945379f3d891a7420aa39!14-29056CF89BB2D216!162",
"self": "https://www.onenote.com/api/v1.0/me/notes/pages/0-a50842a9873945379f3d891a7420aa39!14-29056CF89BB2D216!162",
"createdTime": "2016-03-24T20:38:16Z",
"parentSection#odata.context": "https://www.onenote.com/api/v1.0/$metadata#me/notes/pages('0-a50842a9873945379f3d891a7420aa39%2114-29056CF89BB2D216%21162')/parentSection(id,name,self)/$entity",
"parentSection": {
"id": "0-29056CF89BB2D216!162",
"name": "Harrie's Section",
"self": "https://www.onenote.com/api/v1.0/me/notes/sections/0-29056CF89BB2D216!162"
}
}
You can also filter server side (if you want to save yourself from paging and regex's ;) ) for id's in the links by using:
GET ~/pages?$filter=contains(links/oneNoteClientUrl/href,'a60de930-0b03-4527-bf54-09f3b61d8838')

Finding all pages containing images in Wikimedia Commons category via API

I'm currently trying to find all the pages where images/media from a particular category are being used on Wikimedia Commons.
Using the API, I can list all the images with no problem, but I'm struggling to make the query add in all the pages where the items are used.
Here is an example category with only two media images
https://commons.wikimedia.org/wiki/Category:Automobiles
Here is the API call I am using
https://commons.wikimedia.org/w/api.php?action=query&prop=images&format=json&generator=categorymembers&gcmtitle=Category%3AAutomobiles&gcmprop=title&gcmnamespace=6&gcmlimit=200&gcmsort=sortkey
The long term aim is to find all the pages the images from our collections appear on and then get all the tags from those pages about the images. We can then use this to enhance our archive of information about those images and hopefully used linked data to find relevant images we may not know about from DBpedia.
I might have to do two queries, first get the images then request info about each page, but I was hoping to do it all in one call.

Assuming that you don't need to recurse into subcategories, you can just use a prop=globalusage query with generator=categorymembers, e.g. like this:
https://commons.wikimedia.org/w/api.php?action=query&prop=globalusage&generator=categorymembers&gcmtitle=Category:Images_from_the_German_Federal_Archive&gcmtype=file&gcmlimit=200&continue=
The output, in JSON format, will looks something like this:
// ...snip...
"6197351": {
"pageid": 6197351,
"ns": 6,
"title": "File:-Bundesarchiv Bild 183-1987-1225-004, Schwerin, Thronsaal-demo.jpg",
"globalusage": [
{
"title": "Wikipedia:Fotowerkstatt/Archiv/2009/M\u00e4rz",
"wiki": "de.wikipedia.org",
"url": "https://de.wikipedia.org/wiki/Wikipedia:Fotowerkstatt/Archiv/2009/M%C3%A4rz"
}
]
},
"6428927": {
"pageid": 6428927,
"ns": 6,
"title": "File:-Fernsehstudio-Journalistengespraech-crop.jpg",
"globalusage": [
{
"title": "Kurt_von_Gleichen-Ru\u00dfwurm",
"wiki": "de.wikipedia.org",
"url": "https://de.wikipedia.org/wiki/Kurt_von_Gleichen-Ru%C3%9Fwurm"
},
{
"title": "Wikipedia:Fotowerkstatt/Archiv/2009/April",
"wiki": "de.wikipedia.org",
"url": "https://de.wikipedia.org/wiki/Wikipedia:Fotowerkstatt/Archiv/2009/April"
}
]
},
// ...snip...
Note that you will very likely have to deal with query continuations, since there may easily be more results than MediaWiki will return in a single request. See the linked page for more information on handling those (or just use an MW API client that handles them for you).

I don't understand your use case ("our collections"?) so I don't know why you want to use the API directly, but if you want to recurse in categories you're going to do a lot of wheel reinvention.
Most people use the tools made by Magnus Manske, creator of MediaWiki: in this case it's GLAMourous. Example with 3 levels of recursion (finds 186k images, 114k usages): https://tools.wmflabs.org/glamtools/glamorous.php?doit=1&category=Automobiles&use_globalusage=1&depth=3
Results can also be downloaded in XML format, so it's machine-readable.

Google Cloud Printing and Capabilities PPD

We are having some success printing via Googles Cloud Print service. But wondering if anyone has information regarding the capabilities parameter when submitting a job to print and some pointers in how to create and work this format which I believe is ppd.
We have been able to get the capabilities of the printer via using the method http://www.google.com/cloudprint/printer which returns all the values for our printer. The problem is we don't quite understand what we are meant to do with this in order to define the capability options we would like to print with. This would include options for the copies of pages printed, paper type and print quality. An example of the capabilities information we can receive is like this :
{
"name": "copies",
"displayName": "Copies",
"type": "ParameterDef"
}
{
"UIType": "PickOne",
"name": "HPEconoMode",
"displayName": "EconoMode",
"type": "Feature",
"options": [
{
"ppd:value": "\"\"",
"default": true,
"name": "PrinterDefault",
"displayName": "Printer's Current Setting"
},
{
"ppd:value": "\u003c\u003c/EconoMode true\u003e\u003e setpagedevice",
"name": "True",
"displayName": "Save Toner"
},
{
"ppd:value": "\u003c\u003c/EconoMode false\u003e\u003e setpagedevice",
"name": "False",
"displayName": "Highest Quality"
}
]
}

The GCP documentation is badly lacking in this regard. Anyway, I've managed to find that the correct parameter to send printer settings is ticket, not capabilities. The first part of the parameters corresponds to the basic settings from the print dialog and they are quite self-explanatory and the values are easy to change. The vendor_ticket_item array is a bit more complicated. It contains id/value pairs described by the printer capabilities. The id will contain the name of the parameter from the capabilities and the value will contain the name of one of the records in the parameter options, or a numeric value etc, as described in the capabilities.
For mode details please take a look at my full solution.
{
"version":"1.0",
"print":{
"color":{"vendor_id":"psk:Color","type":0},
"duplex":{"type":0},
"page_orientation":{"type":1},
"copies":{"copies":1},
"dpi":{"horizontal_dpi":600,"vertical_dpi":600},
"media_size":{"width_microns":148000,"height_microns":210000,"is_continuous_feed":false},
"collate":{"collate":true}
,
"vendor_ticket_item":[
//Printer specific settings here, from the capabilities:
{"id":"psk:JobInputBin","value":"ns0000:Tray3"},
{"id":"psk:PageICMRenderingIntent","value":"psk:Photographs"},
{"id":"psk:PageMediaType","value":"ns0000:Auto"},
{"id":"psk:JobOutputBin","value":"ns0000:Auto"},
//etc.
]
}
}

How to recognise adult content programmatically?

I am currently developing a website for a client. It consists of users being able to upload pictures to be shown in a gallery on the site.
The problem we have is that when a user uploads an image it would obviously need to be verified to make sure it is safe for the website (no pornographic or explicit pictures). However my client would not like to manually have to accept every image that is being uploaded as this would be time consuming and the users' images would not instantly be online.
I am writing my code in PHP. If needs be I could change to ASP.net or C#. Is there any way that this can be done?

2019 Update
A lot has changed since this original answer way back in 2013, the main thing being machine learning. There are now a number of libraries and API's available for programmatically detecting adult content:
Google Cloud Vision API, which uses the same models Google uses for safe search.
NSFWJS uses TensorFlow.js claims to achieve ~90% accuracy and is open source under MIT license.
Yahoo has a solution called Open NSFW under the BSD 2 clause license.
2013 Answer
There is a JavaScript library called nude.js which is for this, although I have never used it. Here is a demo of it in use.
There is also PORNsweeper.
Another option is to "outsource" the moderation work using something like Amazon Mechanical Turk, which is a crowdsourced platform which "enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do". So you would basically pay a small amount per moderation item and have an outsourced actual human to moderate the content for you.
The only other solution I can think of is to make the images user moderated, where users can flag inappropriate posts/images for moderation, and if nobody wants to manually moderate them they can simply be removed after a certain number of flags.
Here are a few other interesting links on the topic:
http://thomas.deselaers.de/publications/papers/deselaers_icpr08_porn.pdf
http://www.naun.org/multimedia/NAUN/computers/20-462.pdf
What is the best way to programmatically detect porn images?

The example below does not give you 100% accurate results but it should help you a least a bit and works out of the box.
<?php
$url = 'http://server.com/image.png';
$data = json_decode(file_get_contents('http://api.rest7.com/v1/detect_nudity.php?url=' . $url));
if (#$data->success !== 1)
{
die('Failed');
}
echo 'Contains nudity? ' . $data->nudity . '<br>';
echo 'Nudity percentage: ' . $data->nudity_percentage . '<br>';

If you are looking for an API-based solution, you may want to check out Sightengine.com
It's an automated solution to detect things like adult content, violence, celebrities etc in images and videos.
Here is an example in PHP, using the SDK:
<?php
$client = new SightengineClient('YourApplicationID', 'YourAPIKey');
$output = $client>check('nudity')>image('https://sightengine.com/assets/img/examples/example2.jpg');
The output will then return the classification:
{
"status": "success",
"request": {
"id": "req_VjyxevVQYXQZ1HMbnwtn",
"timestamp": 1471762434.0244,
"operations": 1
},
"nudity": {
"raw": 0.000757,
"partial": 0.000763,
"safe": 0.999243
},
"media": {
"id": "med_KWmB2GQZ29N4MVpVdq5K",
"uri": "https://sightengine.com/assets/img/examples/example2.jpg"
}
}
Have a look at the documentation for more details: https://sightengine.com/docs/#nudity-detection
(disclaimer: I work there)

There is a free API that detects adult content (porn, nudity, NSFW).
https://market.mashape.com/purelabs/sensitive-image-detection
We've using it on our production environment and I would say it works pretty good so far. There are some false detections though, it seems they prefer to mark the image as unsafe if they are unsure.

It all depends on the level of accuracy you are looking for, simple skin tone detection (like nude.js) will prob get you 60-80% accuracy on a generous sample set, for anything more accurate than that, let's say 90-95%, you are going to need some specialized computer vision system with an evolving model that is revised over time. For the latter you might want to check out http://clarifai.com or https://scanii.com (which I work on)

Microsoft Azure has a very cool API called Computer Vision, which you can use for free (either through the UI or programmatically) and has tons of documentation, including for PHP.
It has some amazingly accurate (and sometimes humorous) results.
Outside of detecting adult and "racy" material, it will read text, guess your age, identify primary colours, etc etc.
You can try it out at azure.microsoft.com.
Sample output from a "racy" image:
FEATURE NAME: VALUE:
Description { "tags": [ "person", "man", "young", "woman", "holding",
"surfing", "board", "hair", "laying", "boy", "standing",
"water", "cutting", "white", "beach", "people", "bed" ],
"captions": [ { "text": "a man and a woman taking a selfie",
"confidence": 0.133149087 } ] }
Tags [ { "name": "person", "confidence": 0.9997446 },
{ "name": "man", "confidence": 0.9587285 },
{ "name": "wall", "confidence": 0.9546831 },
{ "name": "swimsuit", "confidence": 0.499717563 } ]
Image format "Jpeg"
Image dimensions 1328 x 2000
Clip art type 0
Line drawing type 0
Black and white false
Adult content true
Adult score 0.9845981
Racy true
Racy score 0.964191854
Categories [ { "name": "people_baby", "score": 0.4921875 } ]
Faces [ { "age": 37, "gender": "Female",
"faceRectangle": { "top": 317, "left": 1554,
"width": 232, "height": 232 } } ]
Dominant color background "Brown"
Dominant color foreground "Black"
Accent Color #0D8CBE

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.