php domdocument or domxpath: how to extract TRs and save html - php

I have been struggling with this all day.
I have an html table in a string.
<TABLE>
<TBODY>
<TR CLASS=dna1>
<TD></TD><TD></TD><TD></TD><TD></TD>
</TR>
<TR CLASS=dna2>
<TD></TD><TD></TD><TD></TD><TD></TD>
</TR>
repeat...
Inside the <TD> are some <DIV> and <SPAN> that I need to work with.
I need to extract each <TR> (both classes) and save the html in an array where each <TR> is an array element.
Creating a node list array is easy enough, but how do I get the actual html?

If you must save the HTML as a string, there is DOMDocument::saveHTML
$elems = $xpath->query('//tr');
foreach ($elems as $elem) {
$array[] = $doc->saveHTML($elem);
}
(Note that the parameter for saveHTML is available as of PHP 5.3.6.)
I'd recommend saving the nodes themselves, though, and converting them to string only shortly before you output them.

Alternatively using DOMDocument only:
$dom = new DOMDocument();
#$dom->loadHTML($html);
if($table=$dom->getElementsByTagName('table')->item(0)){
//traverse the table and output every rows
$rows=array();
foreach ($table->childNodes as $row){
$rows[]=$dom->saveHTML($row);
}
}

Related

Xpath Not Returning Image Sources

I am new to web scraping and I am trying to scrape a few URL's at once, I have created an array with all of the URL's and I am using a for loop to get each one.
$urls = [
"https://escapefromtarkov.gamepedia.com/Weapons",
"https://escapefromtarkov.gamepedia.com/Headwear",
"https://escapefromtarkov.gamepedia.com/Face_cover",
"https://escapefromtarkov.gamepedia.com/Eyewear",
"https://escapefromtarkov.gamepedia.com/Earpieces",
"https://escapefromtarkov.gamepedia.com/Chest_rigs",
"https://escapefromtarkov.gamepedia.com/Body_armor",
"https://escapefromtarkov.gamepedia.com/Backpacks",
"https://escapefromtarkov.gamepedia.com/Pouches",
"https://escapefromtarkov.gamepedia.com/Armbands",
"https://escapefromtarkov.gamepedia.com/Ammunition",
"https://escapefromtarkov.gamepedia.com/Weapon_mods",
"https://escapefromtarkov.gamepedia.com/Meds",
"https://escapefromtarkov.gamepedia.com/Consumables",
"https://escapefromtarkov.gamepedia.com/Loot",
"https://escapefromtarkov.gamepedia.com/Keys_%26_Intel",
"https://escapefromtarkov.gamepedia.com/Containers"
];
for($i = 0; $i < count($urls); $i++)
{
$html = file_get_contents($urls[$i]);
$wiki_doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$wiki_doc->loadHTML($html);
libxml_clear_errors();
$wiki_xpath = new DOMXPath($wiki_doc);
$wiki_row = $wiki_xpath->query('//table[#class="wikitable"]/tbody/tr/td/a/img/#src');
foreach($wiki_row as $row)
{
$row->nodeValue;
}
}
I am looking to get the image src's of each of images within tables with a class of 'wikitable', however when I run this I get no results.
The tbody element is added by the browser. The developer tools DOM view shows a a cleaned up/repaired/unified HTML DOM of the page. Look at the actual source.
<table class="wikitable sortable">
<tr>
<th>Name
</th>
<th>Image
</th>
<th>Cartridge
</th>
<th>Description
</th></tr>
<tr>
<td>AK-101
</td>
<td><a href="/AK-101" title="AK-101"><img alt="AK101 Image.png" src="https://d1u5p3...
Here is no tbody and the class does not contain just wikitable. That can be matched in Xpath 1.0 but it needs a little string magic:
//table[contains(concat(' ', normalize-space(#class), ' '), ' wikitable ')]/tr/td/a/img/#src
There are a couple of problems with the XPath, the first is that using #class="wikitable" meant that if there are other classes in the element, this won't work. You should instead say if the class contains the class your after. The second is that there isn't a <tbody> element in the original document. So the XPath line should be
$wiki_row = $wiki_xpath->query('//table[contains(#class,"wikitable")]/tr/td/a/img/#src');

How to get child element of DOMDocument

Trying to get the specific value from a table with tr and td elements...
HTML:
<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>
PHP:
$html = 'http://www.example.com'; // edited
$dom = new DOMDocument;
#$dom->loadHTML($html);
$data = $dom->getElementsByTagName('tr:nth-child(3n)');
foreach ($data as $datas){
echo $link->nodeValue;
}
Using such or different approach, how to get the value of specific td element... ?
Using getElementsByTagName() returns a list of the tags based on your starting point, so once you've found the table, you can then use the same function to get the <td> tags. You can then just pick out the elements your after...
$data = "<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>";
$dom = new DOMDocument;
$dom->loadHTML($data);
$table = $dom->getElementsByTagName('table');
$td = $table[0]->getElementsByTagName('td'); // Fetch all td elements in the first table
echo $td[2]->nodeValue; // Echo out the value of the 3rd item (zero based arrays)
Prints out..
value3
xPath can be used to get particular element. Try the following code to get 3rd td value from given html.
$html = '<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$table = $dom->getElementsByTagName('table')->item(0);
$query = 'tr/td[3]';
$entries = $xpath->query($query, $table);
echo $entries[0]->nodeValue;
Read about DOMXpath query()
Update: Use of file_get_content is also simple, you can retrieve html/xml as string in $html variable and rest of the process is same:
$html = file_get_contents("path/to/file/x.html"); // target path
I think this should help you.
Just assign class Name for td elements and then use this
for eg:
<td class="two">value2</td>
$(this).closest('tr').children('td.two').text();

PHP DOM/xpath check element span class value

Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent".
I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber.
Thanks in advance.
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td>
<span class="">Product</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
<tr>
<td></td>
<td></td>
<td>
<span class="subcomponent">Component</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
</tbody>
My PHP code
$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[#class=""]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->C14N();
}
To access the table rows themselves using SimpleXML, you can use the following:
$sxml = simplexml_load_string('<table>...</table>');
$rows = $sxml->xpath('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}
The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.
In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.
See https://eval.in/860169 for an example.
Alternative DOMDocument Version
If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:
$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
$version = $cells->item(3)->nodeValue;
$partNumber = $cells->item(4)->nodeValue;
echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}
See https://eval.in/860217
I would use next XPath expression:
//td[text()="Version"] | //td[text()="Partnumber"]
Which gives me:
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
Element='<td>Version</td>'
Element='<td>Partnumber</td>'

preg_match() find all values inside of table?

hey guys,
a curl function returns a string $widget that contains regular html -> two divs where the first div holds a table with various values inside of <td>'s.
i wonder what's the easiest and best way for me to extract only all the values inside of the <td>'s so i have blank values without the remaining html.
any idea what the pattern for the preg_match should look like?
thank you.
Regex is not a suitable solution. You're better off loading it up in a DOMDocument and parsing it.
You're betting off using a DOM parser for that task:
$html = <<<HTML
<div>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
<tr>
<td>hello</td>
<td>world</td>
</tr>
</table>
</div>
<div>
Something irrelevant
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $cell) {
echo "{$cell->textContent}\n";
}
Would output:
foo
bar
hello
world
You shouldn't use regexps to parse HTML. Use DOM and XPath instead. Here's an example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td');
$result = array();
foreach ($nodes as $node) {
$result[] = $node->nodeValue;
}
// $result holds the values of the tds
Only if you have very limited, well-defined HTML can you expect to parse it with regular expressions. The highest ranked SO answer of all time addresses this issue.
He comes ...

Not getting the Tag's nodeValue using DOMDocument class in PHP?

i am using DOMDocument class, to parse HTML document in PHP.
the code of table i am using...
<table>
<tr>
<td> 123 employees </td>
</tr>
<tr>
<td> $50,000 </td>
</tr>
</table>
i am not able to fetch nodeValue of the tag, which are like in the above format,
i.e ($50,000, 123 employees ).
Maybe you could use XPath queries. For example:
$doc = new DOMDocument();
$doc->loadHTML(...your html...);
$xpath = new DOMXPath($doc);
$query = '//table/tr/td';
$entries = $xpath->query($query);
foreach ($entries as $entry)
print_r($entry->nodeValue); (*)
Just replace the line marked with (*) with code that suites your needs.
Cheers.
It's hard to say what you are doing wrong if you don't say, what you are doing at all.
But in principle you must get a reference to the table (as a DOM element), then get the rows and cells (child nodes) and get the node value from there.
If your table is the only one in the document, you'll get it via $doc->getElementsByTagName('table')->item(0) (with $doc being the DOM document object).

Categories