HTML content extraction using Diffbot - php

Can someone help me I want to extract html data from http://www.quranexplorer.com/Hadith/English/Index.html. I have found a service that does exactly that http://diffbot.com/dev/docs/ they support data extraction via a simple api, the problem it that I have a large number of url that needs that needs to be processed. The link below http://test.deen-ul-islam.org/html/h.js
I need to create a script that that follows the url then using the api generate the json format of the html data (the apis from the site allows batch requests check website docs)
Please note diffbot only allows 10000 free request per month so I need a way to save the progress and be able to pick up where I left off.
Here is an example I created using php.
$token = "dfoidjhku";// example token
$url = "http://www.quranexplorer.com/Hadith/English/Hadith/bukhari/001.001.006.html";
$geturl="http://www.diffbot.com/api/article?tags=1&token=".$token."&url=".$url;
$json = file_get_contents($geturl);
$data = json_decode($json, TRUE);
echo $article_title=$data['title'];
echo $article_author=$data['author'];
echo $article_date=$data['date'];
echo nl2br($article_text=$data['text']);
$article_tags=$data['tags'];
foreach($article_tags as $result) {
echo $result, '<br>';
}
I don't mind if the tool is in javascript or php I just need a way to get the html data in json format.

John from Diffbot here. Note: not a developer, but know enough to write hacky code to do simple things.
You have a list of links -- it should be straightforward to iterate through those, making a call to us for each.
Here's a Python script that does such: https://gist.github.com/johndavi/5545375
I used a quick search regex in Sublime Text to pull out the links from the JS file.
To truncate this, just cut out some of the links, then run it. It will take a while as I'm not using the Batch API.
If you need to improve or change this, best seek out a stronger developer directly. Diffbot is a dev-friendly tool.

Related

I want to send requests from a UE4 c++ game to my php script, so that it interacts with a mysql database

i'm searching the inet for around 3 days now and i'm stuck at this.
I got a MySQL Database and a php Script, as well as a Game made in UE4.
UE4 uses c++.
So now i want to send requests from the c++ game to the php script and that shall interact with the database.
For example create an account or login. I also want to pass the mysql query result of the php script to my c++ class.
I tried using HttpRequest, but i can't get data from php to c++ with that.
Maybe you can, but i don't understand it at all.
What i accomplished by now is that you can send a POST request from the game to the php script and pass variables so that the script uses them to perform the mysql query.
But how can i pass data from the php file to c++ now? The response i get is always the whole site (head and body) and i don't know where i could save the query result to pass it to the c++ code.
I'm a full beginner here, so go easy on me. I read so many different posts and blogs that my brain hurts like hell ): I hope someone can tell me how to do this easily or at least give me a hint on what i have to google and what i could use. I don't need a full tutorial, just a name of a library better than the Http.h (if simple HttpRequest cant manage this) would be enough. ): I'm really frustrated...
eXi
The PHP script should retun a HTTP response reduced to a bare minimum. It doesn't even need to be a HTML document:
<?php
// file: api.php
$param = $_POST['myparam'];
$foo = bar($param); // $foo contains e.g. "1,ab,C"
echo $foo; // if you opened http://myhost.com/api.php in a browser
// all you would see is "1,ab,C"
// (which is not a valid HTML document, but who cares)
?>
Then parse this HTTP response (a plain string, that is) from your game. You can use your own data format, or use a well-known format of your choice (XML or JSON are good candidates).
The json object in unreal is pretty good, so I would recommend outputting json from your php script. Json in php is a pretty natural workflow.
<?php
$obj['userid'] = 5476;
$obj['foo'] = 'bar';
echo json_encode($obj);
php?>
That will echo out
{"userid":5476,"foo":"bar"}
If that's all you output in your script then it's pretty straightforward to treat that as a string and populate an unreal json object with it.
FString TheStuffIGotFromTheServer;
TSharedPtr<FJsonObject> ParsedJson;
TSharedRef<TJsonReader<TCHAR>> JsonReader = TJsonReaderFactory<TCHAR>::Create(TheStuffIGotFromTheServer);
if (FJsonSerializer::Deserialize(JsonReader, ParsedJson))
{
FString foo = ParsedJson.GetStringField("foo");
double UserId = ParsedJson.GetNumberField("userid");
}
Check out the unreal json docs to get a feel for what you can do with it.

Get pixel coordinates of HTML/DOM elements using PHP

I am working on an web crawler/site analyzer in php. What I need to do is to extract some tags from a HTML file and compute some attributes (such as image size for example). I can easily do this using a DOM parser, but I would also need to find the pixel coordinates and size of a html/DOM tree element (let's say I have a div and I need to know which area it covers and on which coordinate does it start and if). I can define a standard screen resolution, that is not a problem for me, but I need to retrieve the pixel coordinates automatically, by using a server-side php script (or calling some java app from console or something similar, if needed).
From what I understand, I need a headless browser in php and that would simulate/render a webpage, from which I can retrieve the pixel coordinates I need. Would you recommend me a open-source solution for that? Some code snippets would also be useful, so I would not install the solution and then notice it does not provide pixel coordinates.
PS: I see people who answered missed the point of the question, so it means I did not explain well that I need this solution to work COMPLETELY server-side. Say I use a crawler and it feeds html pages to my script. I could launch it from browser, but also from console (like 'php myScript.php').
maybe you can set the coordinates as some kind of metadata inside your tag using javascript
$("element").data("coordinates",""+this.offset.top+","+this.offset.left);
then you have to request with php
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('element');
foreach ($tags as $tag) {
echo $tag->getAttribute('data'); <-- this will print the coordinates of each tag
}
A Headless browser is an overkill for what you're trying to achieve. Just use cookies to store whatever you want.
So any time you get some piece of information, such as an X,Y coordinate, scroll position, etc. in javascript, simply send it to a PHP script that makes a cookie out of it with some unique string index.
Eventually, you'll have a large array of cookie data that will be directly available to any PHP or javascript file, and you can do anything you'd like with it at that point.
For example, if you wanted to just store stuff in sessions, you could do:
jquery:
// save whatever you want from javascript
// note: probably better to POST, since we're not getting anything really, just showing quick example
$.get('save-attr.php?attr=xy_coord&value=300,550');
PHP:
// this will be the save-attr.php file
session_start();
$_SESSION[$_GET['attr']] = $_GET['value'];
// now any other script can get this value like so:
$coordinates = $_SESSION['xy_coord'];
// where $coordinates would now equal "300,550"
Simple continue this pattern for whatever you need access to in PHP

Parse Google Images API Json PHP

Hey, well I'm trying to use google images api with PHP, and I'm really not sure what to do.
This is basically what I have right now:
$jsonurl = "https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=test";
$json = file_get_contents($jsonurl,0,null,null);
$json_output = json_decode($json);
Where would I go from there to retrieve the first image url?
With a minor change to the last line of your code sample, the following will output the url of the first image in the result set.
<?php
$jsrc = "https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=test";
$json = file_get_contents($jsrc);
$jset = json_decode($json, true);
echo $jset["responseData"]["results"][0]["url"];
?>
For security reasons, most server configurations won't let you use file_get_contents on a remote file (different domain name). It would potentially allow a hacker to load code from anywhere on the Internet to your site, then execute it.
Even if your server configuration does allow for it, then I wouldn't recommend using it for this purpose. The standard tool for retrieving remote HTTP data is cURL, and there are plenty of good tutorials out there doing exactly what you should do in this case.
So, let's say you've successfully used cURL to retrieve the JSON array.
$json_output = json_decode($json); // Now the JSON is an associative array
foreach ($json_output['responseData']['results'] as $result)
{
echo $result['url'] . '<br />';
}
Of course, you don't have to echo the URL there; you can do whatever you need to with the value.
I have to say, this is 10 shades of awesome.. But I come with bad news (don't shoot the messenger..)
Important: The Google Image Search API has been officially deprecated as of May 26, 2011. It will continue to work as per our deprecation policy, but the number of requests you may make per day may be limited.
That is, as they same, lame.
I feel as if Google might have hired one-too-many layed-off-from-IBM-types... as they seem to be killing off all their "cool" API's.
They launch services haphazardly, promising this and that and the other thing... but then some middle-manager gets screamed at after realizing (ta-da!) that XYZ project doesn't generate income (like image results without ads, lol) and then... they axe it..
Lesson: Don't get married (aka build your software or service) around any google API you can't replace on-the-fly at a moment's notice... Now, I'm no LTS-junkie - I'm just bitter because I'd much rather get my Google results via XML or JSON than the icky HTML-soup they throw at you...
One Question #Marcel... How can I get an array, or at least multiple JSON result responses using that same "formula". I thought maybe the 1 meant "result 1", but alas, no... Is their a "trick" to generate a content stream ala a Picasa xml feed?

How can I take a snapshot of a wep page's DOM structure?

I need to compare a webpage's DOM structure at various points in point. What are the ways to retrieve and snapshot it.
I need the DOM on server-side for processing.
I basically need to track structural changes to a webpage. Such as removing of a div tag, or inserting a p tag. Changing data (innerHTML) on those tags should not be seen as a difference.
$html_page = file_get_contents("http://awesomesite.com");
$html_dom = new DOMDocument();
$html_dom->loadHTML($html_page);
That uses PHP DOM. Very simple and actually a bit fun to use. Reference
EDIT: After clarification, a better answer lies here.
Perform the following steps on server-side:
Retrieve a snapshot of the webpage via HTTP GET
Save consecutive snapshots of a page with different names for later comparison
Compare the files with an HTML-aware diff tool (see HtmlDiff tool listing page on ESW wiki).
As a proof-of-concept example with Linux shell, you can perform this comparison as follows:
wget --output-document=snapshot1.html http://example.com/
wget --output-document=snapshot2.html http://example.com/
diff snapshot1.html snapshot2.html
You can of course wrap up these commands into a server-side program or a script.
For PHP, I would suggest you to take a look at daisydiff-php. It readily provides a PHP class that enables you to easily create an HTML-aware diff tool. Example:
<?
require_once('HTMLDiff.php');
$file1 = file_get_contents('snapshot1.html');
$file2 = file_get_contents('snapshot1.html');
HTMLDiffer->htmlDiffer( $file1, $file2 );
?>
Note that with file_get_contents, you can also retrieve data from a given URL as well.
Note that DaisyDiff itself is very fine tool for visualisation of structural changes as well.
If you use firefox, firebug lets you view the DOM structure of any web page.

Basic web-crawling question: How to create a list of all pages on a website using php?

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).
How can I do this in php?
I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.
For the general approach, check out the answers to these questions:
How to write a crawler?
How to best develop web crawlers
Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
is there a good web crawler library available for PHP or Ruby?
Overview
Here are some notes on the basics of the crawler.
It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.
for Reference: http://www.juicer.headrun.com

Categories