I want to get the date object text content and Team 1. But Team 2 object has the same attribute option with date object. How can I get the right content? If I echo $date I get date value with Team2... How should I write conditions?
<table width="100%" cellpadding=2 cellspacing=0 id="tblFixture" border=0>
<tr class=row1 align=center side='home'>
<td align=left>21.09.1928</td>
<td> </td>
<td align='right'><span class='team'>Team 1</span></td>
<td align=left><a href='http://www.foo.com/bar' target='_blank'>Team 2</a></td>
</td>
</tr>
PHP Code:
$url = "http://www.bla.com/bla.html";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$nlig = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']');
$i = 0;
foreach ($nlig AS $val)
{
$date = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\'][#class=\'row1\']/td[#align=\'left\']')->item($i)->textContent;
$first_team = $xpath->query('//table[#id="tblFixture"]/tr[#side=\'home\']/td[#align=\'right\']/span[#class=\'team\']')->item($i)->textContent;
echo $date, $first_team, "<br />";
$i++;
}
You can use a regular expression to validate / find the date.
Something like:
preg_match("/<td align=left>([0-9]{2}.[0-9]{2}.[0-9]{4})<\/td>/", $html, $matches);
Related
Each time it loops, the text that it shows only the Product_URL. I really confuse how to solve this problem. I guess there is something wrong with the loop.
<html>
<head>
<title>Display main Image</title>
</head>
<body>
<table>
<tr>
<th>Thumbnail Image</th>
<th>Product Name</th>
<th>Product Description</th>
<th>Price</th>
<th>Weight</th>
<th>Avail</th>
<th>Product URL</th>
</tr>
<tr>
<?php
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->Load('xml_feeds7.xml');
$xpath = new DOMXPath($doc);
$listquery = array('//item/thumbnail_url', '//item/productname', '//item/productdesciption', '//item/price', '//item/weight', '//item/avail', '//item/product_url');
foreach ($listquery as $queries) {
$entries = $xpath->query($queries);
foreach ($entries as $entry) { ?>
<tr>
<td>
<img src="<?php echo $entry->nodeValue; ?>" width="100px" height="100px">
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php
$price_value = $entry->nodeValue;
echo str_replace($price_value, ".00", "");
?>
</td>
<td>
<?php
$weight_value = $entry->nodeValue;
echo str_replace($weight_value, ".00", "");
?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
<td>
<?php echo "$entry->nodeValue"; ?>
</td>
</tr>
}
}
</tr>
</table>
</body>
</html>
The table should be displaying:
---------------------------------------------------------------------------------
| Thumbnail | Product Name | Description | Price | Weight | Avail | Product_URL |
---------------------------------------------------------------------------------
Xpath can return scalar values (strings and numbers) directly, but you have to do the typecast in the Expression and use DOMxpath::evaluate().
You should iterate the items and then use the item as a context for the detail data expressions. Building separate lists can result in invalid data (if an element in on of the items is missing).
Last you can use DOM methods to create the HTML table. That way it will take care of escaping and closing the tags.
$xml = <<<'XML'
<items>
<item>
<thumbnail_url>image.png</thumbnail_url>
<productname>A name</productname>
<productdescription>Some text</productdescription>
<price currency="USD">42.21</price>
<weight unit="g">23</weight>
<avail>10</avail>
<product_url>page.html</product_url>
</item>
</items>
XML;
$document = new DOMDocument;
$document->preserveWhiteSpace = false;
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$fields = [
'Thumbnail' => 'string(thumbnail_url)',
'Product Name' => 'string(productname)',
'Description' => 'string(productdescription)',
'Price' => 'number(price)',
'Weight' => 'number(weight)',
'Availability' => 'string(avail)',
'Product_URL' => 'string(product_url)'
];
$html = new DOMDocument();
$table = $html->appendChild($html->createElement('table'));
$row = $table->appendChild($html->createElement('tr'));
// add table header cells
foreach ($fields as $caption => $expression) {
$row
->appendChild($html->createElement('th'))
->appendChild($html->createTextNode($caption));
}
// iterate the items in the XML
foreach ($xpath->evaluate('//item') as $item) {
// add a new table row
$row = $table->appendChild($html->createElement('tr'));
// iterate the field definitions
foreach ($fields as $caption => $expression) {
// fetch the value using the expression in the item context
$value = $xpath->evaluate($expression, $item);
switch ($caption) {
case 'Thumbnail':
// special handling for the thumbnail field
$image = $row
->appendChild($html->createElement('td'))
->appendChild($html->createElement('img'));
$image->setAttribute('src', $value);
break;
case 'Price':
case 'Weight':
// number format for price and weight values
$row
->appendChild($html->createElement('td'))
->appendChild(
$html->createTextNode(
number_format($value, 2, '.')
)
);
break;
default:
$row
->appendChild($html->createElement('td'))
->appendChild($html->createTextNode($value));
}
}
}
$html->formatOutput = TRUE;
echo $html->saveHtml();
Output:
<table>
<tr>
<th>Thumbnail</th>
<th>Product Name</th>
<th>Description</th>
<th>Price</th>
<th>Weight</th>
<th>Availability</th>
<th>Product_URL</th>
</tr>
<tr>
<td><img src="image.png"></td>
<td>A name</td>
<td>Some text</td>
<td>42.21</td>
<td>23.00</td>
<td>10</td>
<td>page.html</td>
</tr>
</table>
I've changed it to use SimpleXML as this is a fairly simple data structure - but this fetches each <item> and then displays the values from there. I've only done this with a few values, but hopefully this shows the idea...
$doc = simplexml_load_file('xml_feeds7.xml');
foreach ( $doc->xpath("//item") as $item ) {
echo "<tr>";
echo "<td><img src=\"{$item->thumbnail_url}\" width=\"100px\" height=\"100px\"></td>";
echo "<td>{$item->productname}</td>";
echo "<td>{$item->productdesciption}</td>";
// Other fields...
$price_value = str_replace(".00", "",(string)$item->price);
echo "<td>{$price_value}</td>";
// Other fields...
echo "</tr>";
}
Rather than use XPath for each value, it uses $item->elementName, so $item->productname is the productname. A much simpler way of referring to each field.
Note that with the price field, as you are processing it further - you have to cast it to a string to ensure it will process correctly.
Update:
If you need to access data in a namespace in SimpleXML, you can use XPath, or in this case there is a simple (bit roundabout way). Using the ->children() method you can pass the namespace of the elements you want, this will then give you a new SimpleXMLElement with all the elements for that namespace.
$extraData = $item->children('g',true);
echo "<td>{$extraData->productname}</td>";
Now - $extraData will have any element with g as the namespace prefix, and they can be referred to in the same way as before, but instead of $item you use $extraData.
I have this DOM in my site:
$question_data = '<p>Gambar di bawah menunjukkan ciri-ciri haiwan berikut:</p>
<p><img src="/uploads/images/questions/96_20160303124007.PNG" /></p>
<table border="1" cellpadding="1" cellspacing="1" style="width: 500px;">
<tbody>
<tr>
<td style="text-align: center;">Beranak</td>
<td>
<p style="text-align: center;">Bertelur</p>
</td>
</tr>
</tbody>
</table>
<p>##a</p>';
This my REGEX to filter 96_20160303124007.PNG:
define('GET_IMAGE_NAME_WITH_EXTENSION_PATTERN','/<img .*?src=(?:['\"])[^\"]*\/\K(.*?\.(?:jpeg|jpg|bmp|gif|png))(?:['\"]).*?>/');
$pattern = GET_IMAGE_NAME_WITH_EXTENSION_PATTERN;
$arr_image_file_names = array();
preg_match_all($pattern, $question_data, $arr_image_file_names);
But got no output... anyone knows how to solve this?
It's common knowledge that regex is not the right tool for HTML parsing.
A solution without using regular expression:
$xml = new DOMDocument();
$xml->loadHTML($question_data);
$imgNodes = $xml->getElementsByTagName('img');
$arr_image_file_names = [];
for ($i = $imgNodes->length - 1; $i >= 0; $i--) {
$imgNode = $imgNodes->item($i);
$arr_image_file_names[] = pathinfo($imgNode->getAttribute('src), PATHINFO_BASENAME);
}
change your constant to:
define('GET_IMAGE_NAME_WITH_EXTENSION_PATTERN', '/\d+_\d+\.(?:jpeg|jpg|bmp|gif|png)/i');
I am really struggling attempting to scrape a table either via XPath or any sort of 'getElement' method. I have searched around and attempted various different approaches to solve my problem below but have come up short and really appreciate any help.
First, the HTML portion I am trying to scrape is the 2nd table on the document and looks like:
<table class="table2" border="1" cellspacing="0" cellpadding="3">
<tbody>
<tr><th colspan="8" align="left">Status Information</th></tr>
<tr><th align="left">Status</th><th align="left">Type</th><th align="left">Address</th><th align="left">LP</th><th align="left">Agent Info</th><th align="left">Agent Email</th><th align="left">Phone</th><th align="center">Email Tmplt</th></tr>
<tr></tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center"> </td>
</tr>
<tr>
<td align="left">Active</td>
<td align="left">Resale</td>
<td align="center">*Property Address*</td>
<td align="right">*Price*</td>
<td align="center">*Agent Info*</td>
<td align="center">*Agent Email*</td>
<td align="center">*Agent Phone*</td>
<td align="center"> </td>
</tr>
...etc
With additional trs continuing containing 8 tds with the same information as detailed above.
What I need to do is iterate through the trs and internal tds to pick up each piece of information (inside the td) for each entry (inside of the tr).
Here is the code I have been struggling with:
<?php
$payload = array(
'http'=>array(
'method'=>"POST",
'content'=>'key=value'
)
);
stream_context_set_default($payload);
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('website-scraping-from.com');
libxml_clear_errors();
foreach ($dom->getElementsByTagName('tr') as $row){
foreach($dom->$row->getElementsByTagName('td') as $node){
echo $node->textContent . "<br/>";
}
}
?>
This code is not returning nearly what I need and I am having a lot of trouble trying to figure out how to fix it, perhaps XPath is a better route to go to find the table / information I need, but I have come up empty with that method as well. Any information is much appreciated.
If it matters, my end goal is to be able to take the table data and dump it into a database if the first td has a value of "Active".
Can this be of any help?
$table = $dom->getElementsByTagName('table')->item(1);
foreach ($table->getElementsByTagName('tr') as $row){
$cells = $row->getElementsByTagName('td');
if ( $cells->item(0)->nodeValue == 'Active' ) {
foreach($cells as $node){
echo $node->nodeValue . "<br/>";
}
}
}
This will fetch the second table, and display the contents of the rows starting with a first cell "Active".
Edit: Here is a more extensive help:
$arr = array();
$table = $dom->getElementsByTagName('table')->item(1);
foreach ($table->getElementsByTagName('tr') as $row){
$cells = $row->getElementsByTagName('td');
if ( $cells->item(0)->nodeValue == 'Active' ) {
$obj = new stdClass;
$obj->type = $cells->item(1)->nodeValue;
$obj->address = $cells->item(2)->nodeValue;
$obj->price = $cells->item(3)->nodeValue;
$obj->agent = $cells->item(4)->nodeValue;
$obj->email = $cells->item(5)->nodeValue;
$obj->phone = $cells->item(6)->nodeValue;
array_push( $arr, $obj );
}
}
print_r( $arr );
I am working on a PHP Simple DOM Parser and i want a simple solution for my question
<tr>
<td class="one">1</td>
<td class="two">2</td>
<td class="three">3</td>
</tr>
<tr>
<td class="one">10</td>
<td class="two">20</td>
<td class="three">30</td>
</tr>...
the html of mine is will look similar to the above
and i am looping over through td something like this
foreach ($sample->find("td") as $ele)
{
if($ele->class == "one")
echo "ONE = ".$ele->plaintext;
if($ele->class == "two")
echo "TWO= ".$ele->plaintext;
}
But is there any simple solution that without if condition getting the plaintext of particular class i dont want shorthand if also
I am expecting something like this below
$ele->class->one
take a look at it:
<?php
$html = "
<table>
<tr>
<td class='one'>1</td>
<td class='two'>2</td>
<td class='three'>3</td>
</tr>
<tr>
<td class='one'>10</td>
<td class='two'>20</td>
<td class='three'>30</td>
</tr>
</table>
";
// Your class name
$classeName = 'one';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// Get the results
$results = $xpath->query("//*[#class='" . $classeName . "']");
for($i=0; $i < $results->length; $i++) {
echo $review = $results->item($i)->nodeValue . "<br>";
}
?>
I'm trying to extract img src and the text of the TDs inside the div id="Ajax" but i'm unable to extract the img with my code. It just ignores the img src. How can i extract also the img src and add it in the array?
HTML:
<div id="Ajax">
<table cellpadding="1" cellspacing="0">
<tbody>
<tr id="comment_1">
<td>20:28</td>
<td class="color">
</td>
<td class="last_comment">
Text<br/>
</td>
</tr>
<tr id="comment_2">
<td>20:25</td>
<td class="color">
</td>
<td class="comment">
Text 2<br/>
</td>
</tr>
<tr id="comment_3">
<td>20:24</td>
<td class="color">
<img src="http://url.ext/img/image02.jpeg" alt="img alt 2"/>
</td>
<td class="comment">
Text 3<br/>
</td>
</tr>
<tr id="comment_4">
<td>20:23</td>
<td class="color">
<img src="http://url.ext/img/image01.jpeg" alt="img alt"/>
</td>
<td class="comment">
Text 4<br/>
</td>
</tr>
</div>
PHP:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$contentArray = array();
$doc = $doc->getElementById('Ajax');
$text = $doc->getElementsByTagName ('td');
foreach ($text as $t)
{
$contentArray[] = $t->nodeValue;
}
print_r ($contentArray);
Thanks.
You're using $t->nodeValue to obtain the content of a node. An <img> tag is empty, thus has nothing to return. The easiest way to get the src attribute would be XPath.
Example:
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$expression = "//div[#id='Ajax']//tr";
$nodes = $xpath->query($expression); // Get all rows (tr) in the div
$imgSrcExpression = ".//img/#src";
$firstTdExpression = "./td[1]";
foreach($nodes as $node){ // loop over each row
// select the first td node
$tdNodes = $xpath->query($firstTdExpression ,$node);
$tdVal = null;
if($tdNodes->length > 0){
$tdVal = $tdNodes->item(0)->nodeValue;
}
// select the src attribute of the img node
$imgNodes = $xpath->query($imgSrcExpression,$node);
$imgVal = null;
if($imgNodes ->length > 0){
$imgVal = $imgNodes->item(0)->nodeValue;
}
}
(Caution: Code may contain typos)