Extract HTML from a site using PHP [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 8 years ago.
This is the site which I am referring.
I have search through stackoverflow and tried various suggested php methods like file_get_contents() and readfile() method but it cannot retrieve the table value from the site.
i tried to view the source from the page and I could not locate the table value as well. I tried looking for iframe src but to no avail.
Not sure if there is any method which I can use to retrieve such value from the site?
Please advise.

The table's html seems to be generated on the client side (in your browser) with javascript, so it won't show up in the server's response in the way you see it in the browser (you can try disabling javascript and check the site). You can either:
Switch technology, and use some kind of remote controller browser like phantomJS
You can use try to use their raw data. Just open up your browser's developer tools (usually F12) and check what URL's are fetched. You might need to try to analyze the site's javascript code to make sense of these. You should see something like this:
In both cases, check with the site's owners if they are OK with this kind of use (read their data use policy if they have one or just e-mail them), most site owners are not exactly too happy this kind of crawling.

Use the logic of curl, please refer this example
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>

Related

Using of Wikipedia API with Rest Clients [duplicate]

This question already has answers here:
How to get results from the Wikipedia API with PHP?
(4 answers)
Closed 9 years ago.
I'm trying to get wikipedia pages (from particular category) using of MediaWiki. For this I'm following this tutorial Listing 3. Listing pages within a category. My question is: How to get Wikipedia pages without using of Zend Framework? And is there any Rest Clients based on php without need to install? Because Zend requires to install their package first and some configurations... and I don't want to do all this stuff.
After googling and some investigation I have found a tool called cURL, using of cURL with PHP can also buid a rest service. I really new in implementing rest services, but already tried to implement something in php:
<?php
header('Content-type: application/xml; charset=utf-8');
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$wiki = "http://de.wikipedia.org/w/api.php?action=query&list=allcategories&acprop=size&acprefix=haut&format=xml";
$result = curl($wiki);
var_dump($result);
?>
But got the errors in the result. Could anyone to help with this?
UPDATE:
This page contains the following errors:
error on line 1 at column 1: Document is empty
Below is a rendering of the page up to the first error.
Sorry for taking so long to reply, but better late than never...
When I run your code on the command line, the output I get is:
string(120) "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.
"
So it seems the problem is that you're bumping into Wikimedia bot User-Agent policy by not telling cURL to send a custom User-Agent header. To fix this, follow the advice given at the bottom of that page and add lines like the following into your script (alongside the other curl_setopt() calls):
$agent = 'ProgramName/1.0 (http://example.com/program; your_email#example.com)';
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
Ps. You probably also don't want to set an application/xml content type unless you're sure that the content actually is valid XML. In particular, the output of var_dump() will not be valid XML, even if the input is.
For testing and development, I'd suggest either running PHP from the command line or using the text/plain content type. Or, if you prefer, use text/html and encode your output with htmlspecialchars().
Ps. Made this a community wiki answer, since I realized that this question has already been asked and answered before.

How to collect HTML source response from a remote server?

From within the HTML code in one of my server pages I need to address a search of a specific item on a database placed in another remote server that I don’t own myself.
Example of the search type that performs my request: http://www.remoteserver.com/items/search.php?search_size=XXL
The remote server provides to me - as client - the response displaying a page with several items that match my search criteria.
I don’t want to have this page displayed. What I want is to collect into a string (or local file) the full contents of the remote server HTML response (the code we have access when we click on ‘View Source’ in my IE browser client).
If I collect that data (it could easily reach reach 50000 bytes) I can then filter the one in which I am interested (substrings) and assemble a new request to the remote server for only one of the specific items in the response provided.
Is there any way through which I can get HTML from the response provided by the remote server with Javascript or PHP, and also avoid the display of the response in the browser itself?
I hope I have not confused your minds …
Thanks for any help you may provide.
As #mario mentioned, there are several different ways to do it.
Using file_get_contents():
$txt = file_get_contents('http://www.example.com/');
echo $txt;
Using php's curl functions:
$url = 'http://www.mysite.com';
$ch = curl_init($url);
// Tell curl_exec to return the text instead of sending it to STDOUT
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Don't include return header in output
curl_setopt($ch, CURLOPT_HEADER, 0);
$txt = curl_exec($ch);
curl_close($ch);
echo $txt;
curl is probably the most robust option because you have options for more control over the exact request parameters and possibilities for error handling when things don't go as planned

Trying to find the best method

I will set up a register page using MSSQL.
The system must work like:
User appends data at something.com/register.php
The data is sent to host-ip-address/regsecond.php which my database will be at. (For security reasons, this php page wont directly access to the database.
The php page at host will start another PHP page or EXE file will directly reach database directly and securely.
As my php level is not high, I wanted to learn If i could start php scripts which will work and do their job without coming into users browsers. Here I explain what I say:
" I append some data at x.php, and it starts another PHP script which will do the job with the DATA appended from x.php but the -another PHP script- wont come into users browser "
I was hopefully clear ,as summary, should I use exe [will be harder] or can I start PHP script without coming into browser. And how of course.
You can do this using the curl extension. You can find info on it here:
http://php.net/manual/en/book.curl.php
You can do something like the following:
$postdata = array(
'item1' => 'data'
);
$ch = curl_init("http://host-ip-address/regsecond.php");
curl_setopt ($ch, CURLOPT_POST, true);
curl_setopt ($ch, CURLOPT_POSTFIELDS, $postdata);
curl_exec($ch);
curl_close($ch);
This makes a call directly from your first script to your second script without exposing anything to the user. On the far side, the data will come in as regular post data ($_POST).
You can't post data through PHP to a different website.
If you would like your website then you can configure your PHP script to connect to a different server for your MySQL, I wouldn't say it's a huge amount safer. For example
Instead of:
mysql_connect(localhost,username,password);
Try this
mysql_connect(http://your-ip:portnumber,username,password);
I'm not sure I understand this correctly but you may
§1 use a "public" php script that invokes a private one:
<?php
//public register script
//now call private
//store data to txt-file or similar..
require('/path/outside/www-data/script_that_processes_further.php');
§2 request a script at another server,
<?php
file_get_contents('http://asdf.aspx?firstname=' . $theFirstName); //simplistic
//other options would be curl, xml/soap or whatever.
§1 may be used with §2.
regards,
/t

In any languages, Can I capture a webpage and save it image file? (no install, no activeX)

I heard it is possible to capture webpages by using PHP(maybe above 6.0) on windows server.
I got some sample code and tested. but there are no code to perform rightly.
If you know some right ways to capture webpage save it image file on web applications?
Please teach me.
you could use the browsershots api http://browsershots.org/
with the xml-rpc interface you really could use almost any language to access it.
http://api.browsershots.org/xmlrpc/
Though you have asked for a PHP solution, I would like to share yet another solution with Perl. WWW::Mechanize along with LWP::UserAgent and HTML::Parser can help in screen scraping.
Some documents for reference:
Web scraping with WWW::Mechanize
Screen-scraping with WWW::Mechanize
Downloading the html of a web page is commonly known as screen scraping. This can be useful if you want a program to extract data from a given page. The easiest way to request HTTP resources is to use a tool call cURL. cURL comes as a stand alone unix tool, but there are libraries to use it in about every programming language. To capture this page from the Unix command line type:
curl http://stackoverflow.com/questions/1077970/in-any-languages-can-i-capture-a-webpageno-install-no-activex-if-i-can-plz
In PHP, you can do the same:
<?php
$ch = curl_init() or die(curl_error());
curl_setopt($ch, CURLOPT_URL,"http://stackoverflow.com/questions/1077970/in-any-languages-can-i-capture-a-webpageno-install-no-activex-if-i-can-plz");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data1=curl_exec($ch) or die(curl_error());
echo "<font color=black face=verdana size=3>".$data1."</font>";
echo curl_error($ch);
curl_close($ch);
?>
Now before copying an entire website, you should check their robots.txt file to see if they allow robots to spider their site, and you may want to check if there is an API available which allows you to retrieve the data without the HTML.

Make cURL behave like exactly like form

I have a form on my site which sends data to some remote site - simple html form.
What I want to do is to use data user enters into form for statistical purposes.
So I instead of sending data to the remote page I send it first to my script which resends it the remote site.
The thing is I need it to behave in exact way the usual form would behave taking user to the remote site and displaying resources.
When I use this code it kinda works but not in the way I want it to:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $action);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
curl_close($ch);
Problem is that it displays response in the same script. For example if $action is for example:
somesite.com/processform.php and my script name is mysqcript.php it would display the response of "somesite.com/processform.php" inside "mysqcript.php" so all the relative links are not working.
How do I make it to send the user to "somesite.com/processform.php"? Same thing that pressing the button would do?
Leonti
I think you will have to do this on your end, as translating relative paths is the client's job. It should be simple: Just take the base directory of the request you made
http://otherdomain.com/my/request/path.php
and add it in front of every outgoing link that does not begin with "/" or a protocol ("http://", "ftp://").
Detecting all the outgoing links is hard, but I am 100% sure there are ready-made PHP classes that do that. Check for example this article and the getLinks() function in the user comments. I am not 100% sure whether this is what you need but it certainly goes to the right direction.
Here are a couple of possible solutions, which I post separately so they don't get mixed up with the one I recommend:
1 - keep using cURL, parse the response and add a <base/> tag to it. It should work for pretty much everything on that page.
<base href="http://realsite.com/form_url.php" />
2 - do not alter the submit URL. Submit the form to the real URL, but capture its content using some Javascript library (YUI does that) and send it to your script via XHR. It's still kind of hacky though.
There are several ways to do that. Here's one of the easiest: just use a 307 redirect.
header('Location: http://realsite.com/form_url.php', true, 307');
You can do your logging and stuff either before or after header() but if you do it after calling header() you will need to start your script with
ignore_user_abort(true);
Note that browsers are supposed to notify the user that their form is being redirected.

Categories