Datascraping With PHP - php

I am trying to take advantage of DOMDocument to scrape a table from another website. I am on shared hosting.
Here is what the html looks like:
<tbody>
<tr class="odd">
<td class="nightclub">Elleven</td>
<td class="city">Downtown Miami</td>
</tr>
<tr class="even">
<td class="night club">Story</td>
<td class="city">South Beach</td>
</tr>
</tbody>
I tried doing:
<?php
$domDoc = new \DOMDocument();
$url = "http://example.com/";
$html = file_get_contents($url);
$domDoc->loadHtml($html);
$domDoc->preserveWhiteSpace = false;
$tables = $domDoc->getElementsByTagName('tbody');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$columns = $row->getElementsByTagName('td');
print $columns->item(0)->nodeValue."/n";
print $columns->item(1)->nodeValue."/n";
print $columns->item(2)->nodeValue;
}
When I do this I get not result. I think the server is blocking my request.

try with simplehtmldom Here
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all tr
foreach($html->find('tr') as $element)
echo $element->innertext . '<br>';
Its good library to parse HTML Manual

What I did was used a open sources PHP packaged called Guzzle. It will even allow you to Crawl into the site you are using.
If you are on shared hosting then download Guzzle and upload it to your server.
github.com/guzzle/guzzle/releases
<?php
require 'vendor/autoload.php';
$client = new GuzzleHttp\Client();
$domDoc = new DOMDocument();
$url = 'http://example.com';
$res = $client->request('GET', $url, [
'auth' => ['user', 'pass']
]);
$html = (string)$res->getBody();
// The # in front of $domDoc will suppress any warnings
$domHtml = #$dom->loadHTML($html);
//discard white space
$domDoc->preserveWhiteSpace = false;
//the table by its tag name
$tables = $domDoc->getElementsByTagName('tbody');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $row)
{
// get each column by tag name
$columns = $row->getElementsByTagName('td');
// echo the values
echo $columns->item(0)->nodeValue.'<br />';
echo $columns->item(1)->nodeValue.'<br />';
echo $columns->item(2)->nodeValue;
}
?>

If you don't mind, this is simplest solution. Use Simple Html Dom like below way:
$html = file_get_html("WWW.YOURDOMAIN.COM");
$data = array();
foreach($html->find("table tr") as $tr){
$row = array();
foreach($tr->find("td") as $td){
/* enter code here */
$row[] = $td->plaintext;
}
$data[] = $row;
}
See detailed answer here.

Your Code is perfect only remove \
$domDoc = new \DOMDocument();
Try
$domDoc = new DOMDocument();

Related

How to select text from HTML table using PHP DOM query?

How can I get text from HTML table cells using PHP DOM query?
HTML table is:
<table>
<tr>
<th>Job Location:</th>
<td>Kabul
</td>
</tr>
<tr>
<th>Nationality:</th>
<td>Afghan</td>
</tr>
<tr>
<th>Category:</th>
<td>Program</td>
</tr>
</table>
I have following query but it doesn't work:
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($html);
$xmlPageXPath = new DOMXPath($xmlPageDom);
$value = $xmlPageXPath->query('//table td /text()');
get a complete table with php domdocument and print it
The answer is like this:
$html = "<table ID='myid'><tr><td>1</td><td>2</td></tr><tr><td>4</td><td>5</td></tr><tr><td>7</td><td>8</td></tr></table>";
$xml = new DOMDocument();
$xml->validateOnParse = true;
$xml->loadHTML($html);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[#id='myid']")->item(0);
$rows = $table->getElementsByTagName("tr");
foreach ($rows as $row) {
$cells = $row -> getElementsByTagName('td');
foreach ($cells as $cell) {
print $cell->nodeValue;
}
}
EDIT: Use this instead
$table = $xpath->query("//table")->item(0);

Getting data from HTML using DOMDocument

I'm trying to get data from HTML using DOM. I can get some data, but can't figure out how to get the rest. Here is an image highlighting the data I want.
http://i.imgur.com/Es51s5s.png
here is the code itself
http://pastebin.com/Re8qEivv
and here my PHP code
$html = file_get_contents('result.html');
$dom = new DOMDocument;
$dom->loadHTML($html);
$tr = $dom->getElementsByTagName('tr');
foreach ($tr as $row){
$td = $row->getElementsByTagName('td');
$td1 = $td->item(1);
$td2 = $td->item(2);
foreach ($td1->childNodes as $node){
$title = $node->textContent;
}
foreach ($td2->childNodes as $node){
$type = $node->textContent;
}
}
Figured it out
$html = file_get_contents('result.html');
$dom = new DOMDocument;
$dom->loadHTML($html);
$tr = $dom->getElementsByTagName('tr');
foreach ($tr as $row){
$td = $row->getElementsByTagName('td');
$td1 = $td->item(1);
$td2 = $td->item(2);
$title = $td1->childNodes->item(0)->textContent;
$firstURL = $td1->getElementsByTagName('a')->item(0)->getAttribute('href');
$type = $td2->childNodes->item(0)->textContent;
$imageURL = $td2->getElementsByTagName('img')->item(0)->getAttribute('src');
}
I have used following class.
http://sourceforge.net/projects/simplehtmldom/
This is very simple and easy to use class.
You can use
$html->find('#RosterReport > tbody', 0);
to find specific table
$html->find('tr')
$html->find('td')
to find table rows or columns
Note $html is variable have full html dom content.

Parsing html with php and ganon

please help me to change selector for my code.
I try to get sellen name from page http://www.plati.ru/asp/seller.asp?id_s=119777
It's must be amedia, but I can't to get it.
This is my code
$result = curl_exec($ch);
curl_close($ch);
$html = str_get_dom($result );
foreach ($html('table tr td tr td') as $element) {
$seller_name = $element->getPlainText();
}
You can try the following and let me know if you still have any difficulties,
include "ganon.php";
$shopUrl = "http://www.plati.ru/asp/seller.asp?id_s=119777";
$html = file_get_dom($shopUrl);
echo $html('table',9)->getPlainText();
you can use DomDocument like this code to retrieve the td value :
<?php
header('Content-Type: text/html; charset=utf-8');
$DOM = new DOMDocument;
#$DOM->loadHTMLFile('http://www.plati.ru/asp/seller.asp?id_s=119777');
$tables = $DOM->getElementsByTagName('table');//->item(10);
$table = $tables->item(9);
$cells = $table->getElementsByTagName('td');
$cell = $cells->item(0);
echo $cell->textContent;
?>
the split the $cell->textContent using spaces.

Preserving <br> tags when parsing HTML text content

I have a little issue.
I want to parse a simple HTML Document in PHP.
Here is the simple HTML :
<html>
<body>
<table>
<tr>
<td>Colombo <br> Coucou</td>
<td>30</td>
<td>Sunny</td>
</tr>
<tr>
<td>Hambantota</td>
<td>33</td>
<td>Sunny</td>
</tr>
</table>
</body>
</html>
And this is my PHP code :
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("test.html");
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue;
}
But as you can see, I have a <br> tag and I need it, but when my PHP code runs, it removes this tag.
Can anybody explain me how I can keep it?
I would recommend you to capture the values of the table cells with help of XPath:
$values = array();
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $row) {
$row_values = array();
foreach($xpath->query('td', $row) as $cell) {
$row_values[] = innerHTML($cell);
}
$values[] = $row_values;
}
Also, I've had the same problem as you with <br> tags being stripped out of fetched content for the reason that they themselves are considered empty nodes; unfortunately they're not automatically replaced with a newline character (\n);
So what I've done is designed my own innerHTML function that has proved invaluable in many projects. Here I share it with you:
function innerHTML(DOMElement $element, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($element->childNodes as $node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}

How to parse the attribute value of a <a> tag in PHP

I am trying to parse a html page for a database for universities and colleges in US. The code I wrote does fetches the names of the universities but I am unable to to fetch their respective url address.
public function fetch_universities()
{
$url = "http://www.utexas.edu/world/univ/alpha/";
$dom = new DOMDocument();
$html = $dom->loadHTMLFile($url);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$tr = $tables->item(1)->getElementsByTagName('tr');
$td = $tr->item(7)->getElementsByTagName('td');
$rows = $td->item(0)->getElementsByTagName('li');
$count = 0;
foreach ($rows as $row)
{
$count++;
$cols = $row->getElementsByTagName('a');
echo "$count:".$cols->item(0)->nodeValue. "\n";
}
}
This is my code that I have currently.
Please tell me how to fetch the attribute values as well.
Thank you
If you have a reference to an element, you just have to use getAttribute(), so probably:
echo "$count:".$cols->item(0)->getAttribute('href') . "\n";

Categories