Scrape DOMDocument Table for Contents in PHP - php

I am really struggling attempting to scrape a table either via XPath or any sort of 'getElement' method. I have searched around and attempted various different approaches to solve my problem below but have come up short and really appreciate any help.
First, the HTML portion I am trying to scrape is the 2nd table on the document and looks like:
<table class="table2" border="1" cellspacing="0" cellpadding="3">
<tbody>
<tr><th colspan="8" align="left">Status Information</th></tr>
<tr><th align="left">Status</th><th align="left">Type</th><th align="left">Address</th><th align="left">LP</th><th align="left">Agent Info</th><th align="left">Agent Email</th><th align="left">Phone</th><th align="center">Email Tmplt</th></tr>
<tr></tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center"> </td>
</tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center"> </td>
</tr>
...etc
With additional trs continuing containing 8 tds with the same information as detailed above.
What I need to do is iterate through the trs and internal tds to pick up each piece of information (inside the td) for each entry (inside of the tr).
Here is the code I have been struggling with:
<?php
$payload = array(
'http'=>array(
'method'=>"POST",
'content'=>'key=value'
)
);
stream_context_set_default($payload);
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('website-scraping-from.com');
libxml_clear_errors();
foreach ($dom->getElementsByTagName('tr') as $row){
foreach($dom->$row->getElementsByTagName('td') as $node){
echo $node->textContent . "<br/>";
}
}
?>
This code is not returning nearly what I need and I am having a lot of trouble trying to figure out how to fix it, perhaps XPath is a better route to go to find the table / information I need, but I have come up empty with that method as well. Any information is much appreciated.
If it matters, my end goal is to be able to take the table data and dump it into a database if the first td has a value of "Active".

Can this be of any help?
$table = $dom->getElementsByTagName('table')->item(1);
foreach ($table->getElementsByTagName('tr') as $row){
$cells = $row->getElementsByTagName('td');
if ( $cells->item(0)->nodeValue == 'Active' ) {
foreach($cells as $node){
echo $node->nodeValue . "<br/>";
}
}
}
This will fetch the second table, and display the contents of the rows starting with a first cell "Active".
Edit: Here is a more extensive help:
$arr = array();
$table = $dom->getElementsByTagName('table')->item(1);
foreach ($table->getElementsByTagName('tr') as $row){
$cells = $row->getElementsByTagName('td');
if ( $cells->item(0)->nodeValue == 'Active' ) {
$obj = new stdClass;
$obj->type = $cells->item(1)->nodeValue;
$obj->address = $cells->item(2)->nodeValue;
$obj->price = $cells->item(3)->nodeValue;
$obj->agent = $cells->item(4)->nodeValue;
$obj->email = $cells->item(5)->nodeValue;
$obj->phone = $cells->item(6)->nodeValue;
array_push( $arr, $obj );
}
}
print_r( $arr );

Related

I have to display image and data from xml, how can I do it in php?

Each time it loops, the text that it shows only the Product_URL. I really confuse how to solve this problem. I guess there is something wrong with the loop.
<html>
<head>
<title>Display main Image</title>
</head>
<body>
<table>
<tr>
<th>Thumbnail Image</th>
<th>Product Name</th>
<th>Product Description</th>
<th>Price</th>
<th>Weight</th>
<th>Avail</th>
<th>Product URL</th>
</tr>
<tr>
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->Load('xml_feeds7.xml');
$xpath = new DOMXPath($doc);
$listquery = array('//item/thumbnail_url', '//item/productname', '//item/productdesciption', '//item/price', '//item/weight', '//item/avail', '//item/product_url');
foreach ($listquery as $queries) {
$entries = $xpath->query($queries);
foreach ($entries as $entry) { ?>
<tr>
<td>
<img src="<?php echo $entry->nodeValue; ?>" width="100px" height="100px">
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php
$price_value = $entry->nodeValue;
echo str_replace($price_value, ".00", "");
?>
</td>
<td>
<?php
$weight_value = $entry->nodeValue;
echo str_replace($weight_value, ".00", "");
?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
</tr>
}
}
</tr>
</table>
</body>
</html>
The table should be displaying:
---------------------------------------------------------------------------------
| Thumbnail | Product Name | Description | Price | Weight | Avail | Product_URL |
---------------------------------------------------------------------------------
Xpath can return scalar values (strings and numbers) directly, but you have to do the typecast in the Expression and use DOMxpath::evaluate().
You should iterate the items and then use the item as a context for the detail data expressions. Building separate lists can result in invalid data (if an element in on of the items is missing).
Last you can use DOM methods to create the HTML table. That way it will take care of escaping and closing the tags.
$xml = <<<'XML'
<items>
<item>
<thumbnail_url>image.png</thumbnail_url>
<productname>A name</productname>
<productdescription>Some text</productdescription>
<price currency="USD">42.21</price>
<weight unit="g">23</weight>
<avail>10</avail>
<product_url>page.html</product_url>
</item>
</items>
XML;
$document = new DOMDocument;
$document->preserveWhiteSpace = false;
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$fields = [
'Thumbnail' => 'string(thumbnail_url)',
'Product Name' => 'string(productname)',
'Description' => 'string(productdescription)',
'Price' => 'number(price)',
'Weight' => 'number(weight)',
'Availability' => 'string(avail)',
'Product_URL' => 'string(product_url)'
];
$html = new DOMDocument();
$table = $html->appendChild($html->createElement('table'));
$row = $table->appendChild($html->createElement('tr'));
// add table header cells
foreach ($fields as $caption => $expression) {
$row
->appendChild($html->createElement('th'))
->appendChild($html->createTextNode($caption));
}
// iterate the items in the XML
foreach ($xpath->evaluate('//item') as $item) {
// add a new table row
$row = $table->appendChild($html->createElement('tr'));
// iterate the field definitions
foreach ($fields as $caption => $expression) {
// fetch the value using the expression in the item context
$value = $xpath->evaluate($expression, $item);
switch ($caption) {
case 'Thumbnail':
// special handling for the thumbnail field
$image = $row
->appendChild($html->createElement('td'))
->appendChild($html->createElement('img'));
$image->setAttribute('src', $value);
break;
case 'Price':
case 'Weight':
// number format for price and weight values
$row
->appendChild($html->createElement('td'))
->appendChild(
$html->createTextNode(
number_format($value, 2, '.')
)
);
break;
default:
$row
->appendChild($html->createElement('td'))
->appendChild($html->createTextNode($value));
}
}
}
$html->formatOutput = TRUE;
echo $html->saveHtml();
Output:
<table>
<tr>
<th>Thumbnail</th>
<th>Product Name</th>
<th>Description</th>
<th>Price</th>
<th>Weight</th>
<th>Availability</th>
<th>Product_URL</th>
</tr>
<tr>
<td><img src="image.png"></td>
<td>A name</td>
<td>Some text</td>
<td>42.21</td>
<td>23.00</td>
<td>10</td>
<td>page.html</td>
</tr>
</table>
I've changed it to use SimpleXML as this is a fairly simple data structure - but this fetches each <item> and then displays the values from there. I've only done this with a few values, but hopefully this shows the idea...
$doc = simplexml_load_file('xml_feeds7.xml');
foreach ( $doc->xpath("//item") as $item ) {
echo "<tr>";
echo "<td><img src=\"{$item->thumbnail_url}\" width=\"100px\" height=\"100px\"></td>";
echo "<td>{$item->productname}</td>";
echo "<td>{$item->productdesciption}</td>";
// Other fields...
$price_value = str_replace(".00", "",(string)$item->price);
echo "<td>{$price_value}</td>";
// Other fields...
echo "</tr>";
}
Rather than use XPath for each value, it uses $item->elementName, so $item->productname is the productname. A much simpler way of referring to each field.
Note that with the price field, as you are processing it further - you have to cast it to a string to ensure it will process correctly.
Update:
If you need to access data in a namespace in SimpleXML, you can use XPath, or in this case there is a simple (bit roundabout way). Using the ->children() method you can pass the namespace of the elements you want, this will then give you a new SimpleXMLElement with all the elements for that namespace.
$extraData = $item->children('g',true);
echo "<td>{$extraData->productname}</td>";
Now - $extraData will have any element with g as the namespace prefix, and they can be referred to in the same way as before, but instead of $item you use $extraData.

PHP DOM Parser Get Specific text by Class While Looping

I am working on a PHP Simple DOM Parser and i want a simple solution for my question
<tr>
<td class="one">1</td>
<td class="two">2</td>
<td class="three">3</td>
</tr>
<tr>
<td class="one">10</td>
<td class="two">20</td>
<td class="three">30</td>
</tr>...
the html of mine is will look similar to the above
and i am looping over through td something like this
foreach ($sample->find("td") as $ele)
{
if($ele->class == "one")
echo "ONE = ".$ele->plaintext;
if($ele->class == "two")
echo "TWO= ".$ele->plaintext;
}
But is there any simple solution that without if condition getting the plaintext of particular class i dont want shorthand if also
I am expecting something like this below
$ele->class->one
take a look at it:
<?php
$html = "
<table>
<tr>
<td class='one'>1</td>
<td class='two'>2</td>
<td class='three'>3</td>
</tr>
<tr>
<td class='one'>10</td>
<td class='two'>20</td>
<td class='three'>30</td>
</tr>
</table>
";
// Your class name
$classeName = 'one';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Get the results
$results = $xpath->query("//*[#class='" . $classeName . "']");
for($i=0; $i < $results->length; $i++) {
echo $review = $results->item($i)->nodeValue . "<br>";
}
?>

Getting PHP str_replace to work with Joomla

As you may know, Joomla components enable you to override their output by copying their template files into your site template. Joomla components generally use helper files which cannot be overridden.
I have a helper.php file that includes the string:
$specific_fields_text = '<tr><td class="key">'.$specific_field_title.': </td><td class="kr_sidecol_subaddress">'.$specific_fields[$i]->text.' '.$specific_fields[$i]->description.'</td></tr>';
In my template override is the code:
<table border="0" cellpadding="2" cellspacing="0">
<?php echo koparentHTML::getHTMLSpecificFields($this->specific_fields); ?>
</table>
The output is as follows:
<table border="0" cellpadding="2" cellspacing="0">
<tr>
<td class="key">title</td>
<td class="kr_sidecol_subaddress">value</td>
</tr>
<tr>
<td class="key">title</td>
<td class="kr_sidecol_subaddress">value</td>
</tr>
//.....etc......//
</table>
Basically I want to get rid of the table and turn it into a definition list but I cannot modify the helper.php file. I am thinking that the answer is to do with str_replace
I have tried using:
<dl>
<?php
$spec_fields = koparentHTML::getHTMLSpecificFields($this->specific_fields);
$spec_fields_dl = str_replace("<tr><td class='key'>'.$specific_field_title.': </td><td class='kr_sidecol_subaddress'>'.$specific_fields[$i]->text.' '.$specific_fields[$i]->description.'</td></tr>'", "<dt class='key'>'.$specific_field_title.': </dt><dd class='kr_sidecol_subaddress'>'.$specific_fields[$i]->text.' '.$specific_fields[$i]->description.'</dd>'", $spec_fields);
echo $spec_fields_dl;
?>
</dl>
This returns all of the text but with no html tags (no tr, td, dt, etc).
You can easily parse table data with PHP, like in this example:
$doc = new DOMDocument();
$doc->loadHTML(koparentHTML::getHTMLSpecificFields($this->specific_fields));
$rows = $doc->getElementsByTagName('tr');
$data = array();
for ($i = 0; $i < $rows->length; $i++) {
$cols = $rows->item($i)->getElementsbyTagName("td");
$data[$cols->item(0)->nodeValue] = $data[$cols->item(1)->nodeValue];
}
var_dump $data;
This should convert your table into assoc array ('title' => 'value').
I hope it helps.
I have figured this out. For some reason the PHP bits such as '.$specific_field_title.' where stopping the str_replace from working. To get around this I just searched for the HTML elements and put them in an array like so:
echo str_replace(array('<tr><td class="key">', '</td><td class="kr_sidecol_subaddress">', '</td></tr>'),
array('<dt class="key">', '</dt><dd class="kr_sidecol_subaddress">', '</dd>'),
koparentHTML::getHTMLSpecificFields($this->specific_fields));
And now this works perfectly. Thank you to everyone who contributed.

PHP textContent removing HTML?

I have the following script which loops through a HTML table and gets the values from it then returns the value of the table in a td.
$tds = $dom->getElementsByTagName('td');
// New dom
$dom2 = new DOMDocument;
$x = 1;
// Loop through all the tds printing the value with a new class
foreach($tds as $t) {
if($x%2 == 1)
print "</tr><tr>";
$class = ($x%2 == 1) ? "odd" : "even";
var_dump($t->textContent);
print "<td class='$class'>".$t->textContent."</td>";
$x++;
}
But the textContent seems to be stripping the HTML tags (for example it is a <p></p> wrapper tag). How can I get it to just give me the value?
Or is there another way of doing this? I have the following html
<table>
<tr>
<td>q1</td>
<td>a1</td>
</tr>
<tr>
<td>q2</td>
<td>a2</td>
</tr>
</table>
and I need to make it look like
<table>
<tr>
<td class="odd">q1</td>
<td class="even">a1</td>
</tr>
<tr>
<td class="odd">q2</td>
<td class="even">a2</td>
</tr>
</table>
It will always look the exact same way (minus extra element rows and the values which change).
Any help?
According to MDN this is the expected behaviour of textContent.
You can just add the class to the tds in the DomDocument
$tds = $dom->getElementsByTagName('td');
$x = 1;
foreach($tds as $td) {
if($x%2 == 1){
$td->setAttribute('class', 'odd');
}
else{
$td->setAttribute('class', 'even');
}
$x++;
}

PHP DOM accessing the object with same attribute

I want to get the date object text content and Team 1. But Team 2 object has the same attribute option with date object. How can I get the right content? If I echo $date I get date value with Team2... How should I write conditions?
<table width="100%" cellpadding=2 cellspacing=0 id="tblFixture" border=0>
<tr class=row1 align=center side='home'>
<td align=left>21.09.1928</td>
<td> </td>
<td align='right'><span class='team'>Team 1</span></td>
<td align=left><a href='http://www.foo.com/bar' target='_blank'>Team 2</a></td>
</td>
</tr>
PHP Code:
$url = "http://www.bla.com/bla.html";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$nlig = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']');
$i = 0;
foreach ($nlig AS $val)
{
$date = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\'][#class=\'row1\']/td[#align=\'left\']')->item($i)->textContent;
$first_team = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']/td[#align=\'right\']/span[#class=\'team\']')->item($i)->textContent;
echo $date, $first_team, "<br />";
$i++;
}
You can use a regular expression to validate / find the date.
Something like:
preg_match("/<td align=left>([0-9]{2}.[0-9]{2}.[0-9]{4})<\/td>/", $html, $matches);

Categories