php: parsing table structure with SimpleXML - php

I'm trying to read in an xml file that for some reason has been modeled in a table structure like so:
<tr id="1">
<td name="Date">10/01/2009</td>
<td name="PromoName">Sample Promo Name</td>
<td name="PromoCode">Sample Promo Code</td>
<td name="PromoLevel" />
</tr>
This is just one sample row, the file has multiple <tr> blocks and it's all surrounded by <table>.
How can I read in the values, with all of the lines being named <td> name?

You could use simpleXML with an XPath expression.
$xml = simplexml_load_file('myFile.xml');
$values = $xml->xpath('//td[#name]');
foreach($values as $v) {
echo "Found $v<br />";
}
This would give you all the TD node values that have a name attribute, e.g.
Found 10/01/2009
Found Sample Promo Name
Found Sample Promo Code
Found <nothing cuz PromoLevel is empty>
Edit To get through all the Table Rows, you could do something like this:
$rows = $xml->xpath('//tr');
foreach($rows as $row) {
echo $row['id'];
foreach($row->td as $td) {
if($td['name']) {
echo $td['name'],':',$td,'<br/>',PHP_EOL;
}
}
}
You might also want to have a look at this article.
Edit Fixed the XPath expression, as Josh suggested.

Related

Getting DOM elements of html from file_get_contents [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.
Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.

How to extracting Data from HTML table using php

I keep trying different methods of extracting the data from the HTML table such as using xpath. The table(s) do not contain any classes so I am not sure how to use xpath without classes or Id. This data is being retrieved from an rss xml file. I am currently using DOM. After I extract the data, I will try to sort, the tables by Job Title
Here is my php code
$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");
/*What we do here in this loop is retrieve all content inside the encoded content,
*which includes the CDATA information. This is where the HTML and styling is included.
*/
foreach($xml->channel->item as $cont){
$html=''.$cont->children('content',true)->encoded.'<br>'; //actual tag name is encoded
}
$htmlParser= new DOMDocument(); //to parse html using DOMDocument
libxml_use_internal_errors(true); // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html); //Loaded the html string we took from simple xml
$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');
foreach($rows as $row){
$cols = $row->getElementsByTagName('td');
echo $cols;
}
This is the HTML I am extracting info from
<table cellpadding='1' cellspacing='2'>
<tr>
<td><b>Job Title:</b></td>
<td>Job Example </td>
</tr>
<tr>
<td><b>Job ID:</b></td>
<td>23992</td>
</tr>
<tr>
<td><b>Job Description:</b></td>
<td>Just a job example </td>
</tr>
<tr>
<td><b>Job Category:</b></td>
<td>Work-study Position</td>
</tr>
<tr>
<td><b>Position Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Applicant Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>Active</td>
</tr>
<tr>
<td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
</tr>
</table>
You can use xpath to query('//td') and retrieve the td html using C14N(), something like:
$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}
Output:
<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...
C14N();
Returns canonicalized nodes as a string or FALSE on failure
Update:
Another question, how can I grab individual Table Data? For example,
just grab, Job ID
Use XPath contains, i.e.:
foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}
Update V2:
How can I get the next Table Data after that (to actually get the Job
Id) ?
Use following-sibling::*[1], i.e:
echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992
$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
echo $tableDataNodes[$x];
}

Inconsistent elements causing problems using simple html dom parser

I am scraping the following source using simple_html_dom.php:
http://www.forexfactory.com/calendar.php
I am scraping the table elements td.event and td.actual.
Problem is, if you view the source, you can see that the td.event all have span elements, which I am stripping out like such:
$events = array();
foreach ($html->find('td.event') as $event) {
foreach($event->find('span') as $e) {
$events[] = $e->innertext;
}
}
So
<td class="event"><span>Spanish Unemployment Change</span></td>
nicely gives me
Spanish Unemployment Change
However, the td.actual element is inconsistent, some contain span elements, some do not.
So the question is, due to this inconsistency, how do I retrieve the text within the span of some, and not in others ?
Eg
<td class="actual">46.9</td>
vs
<td class="actual"> <span class="better">54.0</span> </td>
<td class="actual"> <span class="worse">-64.4K</span> </td>
You can just use the plaintext method as follows:
$actuals = array();
foreach ($html->find('td.actual') as $actual) {
$actuals[] = $actual->plaintext;
}

I want php code to find href title and some other infos from html table

I create this code until now:
<?php
$url=" SOME HTML URL ";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href');
}
?>
I have html pages with tables so i want the link the title and the date. Example of html code:
<TR>
<TD align="center" vAlign="top" bgColor="#ffffff" class="smalltext">3</TD>
<TD class="plaintext" >THIS IS THE TITLE </TD>
<TD align="center" class="plaintext" >THIS IS DATE</TD>
</TR>
It works fine for me for the link, but i don't know how to take the others.
Tnx.
Where you are doing this:
$tags = $doc->getElementsByTagName('a');
You are getting back all the A tags. There only happens to be one.
If you want to get the text "THIS IS DATE", you're aren't going to get it by looking in A tags because the text is not inside an A tag - it is in a TD tag.
$tds = $doc->getElementsByTagName('td');
... would work to get all the TD elements, or you could assign an ID to the element you want to target and use getElementById instead.
Basically, though, this information is all in the documentation, which you absolutely should read before asking questions. Happy reading!
Once again, that's: http://php.net/manual/en/class.domdocument.php

PHP Using domdocument to extract data from html

I have a table with the following structure. I cannot seem to get the data I want.
<table class="gsborder" cellspacing="0" cellpadding="2" rules="cols" border="1" id="d00">
<tr class="gridItem">
<td>Code</td><td>0adf</td>
</tr><tr class="AltItem">
<td>CompanyName</td><td>Some Company</td>
</tr><tr class="Item">
<td>Owner</td><td>Jim Jim</td>
</tr><tr class="AltItem">
<td>DivisionName</td><td> </td>
</tr><tr class="Item">
<td>AddressLine1</td><td>9314 W. SPRING ST.</td>
</tr>
</table>
This table is of course nested within another table within the page. How can I use DomDocument for example to refer to "Code" and "0adf" as a key value pair? They actually don't need to be in a key value pair but I should be able to call them each separately.
EDIT:
Using PHP Simple HTML, I was able to extract the data I needed using this:
$foo = $html->getElementById("d00")->childNodes(1)->childNodes(1);
The problem with this though is that I am getting the two <td></td> tags with my data. Is there a way to only grab the raw data without the tags?
Also, is this the right way to get my data out of this table?
If you're not dead set on using DOMDocument, try using the PHP Simple HTML DOM Parser. This has the benefit of allowing you to parse HTML which is not valid XML as well as providing a nicer interface to the parsed document.
You could write something like:
$html = str_get_html(...);
foreach($html->find('tr') as $tr)
{
print 'First td: ' . $tr->find('td', 0)->plaintext;
print 'Second td: ' . $tr->find('td', 1)->plaintext;
}

Categories