PHP Using domdocument to extract data from html - php

I have a table with the following structure. I cannot seem to get the data I want.
<table class="gsborder" cellspacing="0" cellpadding="2" rules="cols" border="1" id="d00">
<tr class="gridItem">
<td>Code</td><td>0adf</td>
</tr><tr class="AltItem">
<td>CompanyName</td><td>Some Company</td>
</tr><tr class="Item">
<td>Owner</td><td>Jim Jim</td>
</tr><tr class="AltItem">
<td>DivisionName</td><td> </td>
</tr><tr class="Item">
<td>AddressLine1</td><td>9314 W. SPRING ST.</td>
</tr>
</table>
This table is of course nested within another table within the page. How can I use DomDocument for example to refer to "Code" and "0adf" as a key value pair? They actually don't need to be in a key value pair but I should be able to call them each separately.
EDIT:
Using PHP Simple HTML, I was able to extract the data I needed using this:
$foo = $html->getElementById("d00")->childNodes(1)->childNodes(1);
The problem with this though is that I am getting the two <td></td> tags with my data. Is there a way to only grab the raw data without the tags?
Also, is this the right way to get my data out of this table?

If you're not dead set on using DOMDocument, try using the PHP Simple HTML DOM Parser. This has the benefit of allowing you to parse HTML which is not valid XML as well as providing a nicer interface to the parsed document.
You could write something like:
$html = str_get_html(...);
foreach($html->find('tr') as $tr)
{
print 'First td: ' . $tr->find('td', 0)->plaintext;
print 'Second td: ' . $tr->find('td', 1)->plaintext;
}

Related

Getting DOM elements of html from file_get_contents [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.
Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.

Extracting text using preg_match

I am trying to extract a piece of text from an HTML using PHP command preg_match.
Ive successfully parsed the HTML into a variable, but now I got stuck with extracting the right piece of information - probably because I am a bit confused by the syntax of preg_match.
So basically, here is a piece of the HTML I am interested in:
...<tr >
<td >Metuje</td>
<td ><a href="./detail_stanice/307158.html" >Maršov nad Metují</a></td>
<td >A</td>
<td >90</td>
<td >120</td>
<td >150</td>
<td >cm</td>
<td >04.08. 14:20</td>
<td >31</td>
<td >0.53</td>
<td ><img src="./img/ldown.png" width="15" /></td>
</tr>...
What I need is to find this particular row in the table (which contains couple of other rows), so basically I need to search for the name "Maršov nad Metují" in the second cell and then, extract the values of the subsequent cells on that row into a string, in other words in this particular case I would like to have a string with values A, 90, 120, etc. until the end of the row.
On the website there are then other rows with the exact same format just with different values, so I would then use the same syntax to extract values for rows with different names in the second cell.
I have tried it myself, but I was not able to get the right output.
I tried something like this, but this does not solve the problem, I know I have to somehow implement the cell TD commands, but unfortunately I wasnt able to get it right in this particular case.:
preg_match("/Maršov nad Metují(.*?)\<\/tr/", $html, $results);
Any help is very much appreciated.
Thanks
Try this :
<?php
$info = '<tr ><td >Metuje</td><td ><a href="./detail_stanice/307158.html" >Maršov nad Metují</a></td><td >A</td><td >90</td><td >120</td><td >150</td><td >cm</td><td >04.08. 14:20</td><td >31</td><td >0.53</td><td ><img src="./img/ldown.png" width="15" /></td></tr>';
preg_match('/<a href="(.*)" >(.*)</Ui',$info,$result);
print_r($result[2]);// Maršov nad Metují
preg_match_all("/<td.*?>(.+?)<\/td>/is", $html, $matches);
$result = $matches[1];
array_shift($result);
array_shift($result);
print implode(', ', $result);

Scraping using php - preg_match_all

Trying to get the value of Internet Data Volume Balance - the script should echo 146.30mb
New to all these, having a look at all the tutorials.
How can this be done?
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Account Status</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text">You exceeded your allowed credit.</FONT></div></td>
</tr>
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Period Free Time Remaining</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text">0:00:00 hours</FONT></div></td>
</tr>
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Internet Data Volume Balance</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text" style="text-transform:none;">146.30 MB</FONT></div></td>
</tr>
If you were willing to or have already installed phpQuery, you can use that.
phpQuery::newDocumentFileHTML('htmlpage.html');
echo pq('td:eq(6)')->text();
PHP can interact with the DOM just like JavaScript can. This is vastly superior to parsing the markup, as most people will tell you is the wrong approach anyway:
Loading from an HTML File
// Start by creating a new document
$doc = new DOMDocument();
// I've loaded the table into an external file, and am loading it into the $doc
$doc->loadHTMLFile( 'htmlpage.html' );
// Since you have six table cells, I'm calling up all of them
$cells = $doc->getElementsByTagName("td");
// I'm grabbing the sixth cell's textContent property
echo $cells->item(5)->textContent;
This code will output "146.30 MB" to the screen.
Loading from a String
If you have the HTML stored within a string, you can load that into your document as well. We'll change the method used to load the file, into the method used to load from a string:
$str = "<table><tr><td>Foo</td></tr>...</table>";
$doc->loadHTML( $str );
We would then proceed with the same code as above to select the cells, and show their textContent in the output.
Check out the DOMDocument Class.

strip tags placing a delimiter or store to an array using PHP

I've stripped the tag data from an url like
$url='http://abcd.com';
$d=stripslashes(file_get_contents($url));
echo strip_tags($d);
but unfortunately all the tag values are clubbed together like user14036100 9.00user23034003 11.33user32028000 14.00 where in the user1, user2, user3 attributes are stored, It is hard to analyse the attribute values as all are joined together by strip_tags().
so friends can someone help me to strip each tag and store in an array or by placing a delimiter at the end of each stripped tag data.
Thanks in advance :)
You cannot achieve this with strip_tags(), since it justs removes the tags. You wan't to replace them with e.g. a whitespace character (new line, space, ..).
You should probably do this with a regex call, which just replaces all tags.
A better way would be to parse the fetched page with DOMDocument, so that you can derive the structure directly from the HTML structure.
Example of usage of DOMDocument
You have the following example html page:
<!DOCTYPE html>
<html>
<head>
<title>This is my title</title>
</head>
<body>
<table id="someDataHere">
<tr>
<th>Country</th>
<th>Population</th>
</tr>
<tr>
<td>Germany</td>
<td>81,779,600</td>
</tr>
<tr>
<td>Belgium</td>
<td>11,007,020</td>
</tr>
<tr>
<td>Netherlands</td>
<td>16,847,007</td>
</tr>
</table>
</body>
</html>
You can use DOMDocument to fetch the entries in the table:
$url = "...";
$dom = new DOMDocument("1.0", "UTF-8");
$dom->loadHTML(file_get_contents($url));
$preparedData = array();
$table = $dom->getElementById("someDataHere");
$tableRows = $table->getElementsByTagName('tr');
foreach ($tableRows as $tableRow)
{
$columns = $tableRow->getElementsByTagName('td');
// skip the header row of the table - it has no <td>, just <th>
if (0 == $columns->length)
{
continue;
}
$preparedData[ $columns->item(0)->nodeValue ] = $columns->item(1)->nodeValue;
}
$preparedData will now hold the following data:
Array
(
[Germany] => 81,779,600
[Belgium] => 11,007,020
[Netherlands] => 16,847,007
)
Some notes
Since you are developing a crawler (spider), you are highly dependent on the HTML structure of the target webpage. You may have to adjust your crawler every time they change something in their templates.
This is just a simple example, but it should make clear, how you can now use it, to produce more advanced results.
Since DOMDocument implements the DOM methods, you have to work your way through the HTML structure with the possibilities they provide.
For very huge HTML pages DOMDocument can become quite expensive in terms of memory.

I'm using Simple HTML to grab data out of a table and need help

Sorry for the poor title guys, but I'm whooped. I have a table as such:
<table class="gsborder" cellspacing="0" cellpadding="2" rules="cols" border="1" id="d00">
<tr class="gridItem">
<td>Code</td><td>0adf</td>
</tr><tr class="AltItem">
<td>CompanyName</td><td>Some Company</td>
</tr><tr class="Item">
<td>Owner</td><td>Jim Jim</td>
</tr><tr class="AltItem">
<td>DivisionName</td><td> </td>
</tr><tr class="Item">
<td>AddressLine1</td><td>9314 W. SPRING ST.</td>
</tr>
</table>
I'm using the following code to get my data out:
$foo = $html->getElementById("d00")->childNodes(1)->childNodes(1);
The problem with this though is that I am getting the two <td></td> tags with my data. Is there a way to only grab the raw data without the tags?
Also, is this the right way to get my data out of this table?
Try using:
$foo = $html->getElementById("d00")->childNodes(1)->childNodes(1)->plaintext;
or innertext.
// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);
echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"
taken from: http://simplehtmldom.sourceforge.net/manual.htm
As a rule of thumb, whatever DOM API you are using, once you've located the element(s) you are interested in getting data from, accessing the text nodes they contain requires a bit more work.
Use strip_tags to get raw text.
http://us.php.net/manual/en/function.strip-tags.php
So:
$foo = strip_tags($html->getElementById("d00")->childNodes(1)->childNodes(1));

Categories