Xpath Not Returning Image Sources - php

I am new to web scraping and I am trying to scrape a few URL's at once, I have created an array with all of the URL's and I am using a for loop to get each one.
$urls = [
"https://escapefromtarkov.gamepedia.com/Weapons",
"https://escapefromtarkov.gamepedia.com/Headwear",
"https://escapefromtarkov.gamepedia.com/Face_cover",
"https://escapefromtarkov.gamepedia.com/Eyewear",
"https://escapefromtarkov.gamepedia.com/Earpieces",
"https://escapefromtarkov.gamepedia.com/Chest_rigs",
"https://escapefromtarkov.gamepedia.com/Body_armor",
"https://escapefromtarkov.gamepedia.com/Backpacks",
"https://escapefromtarkov.gamepedia.com/Pouches",
"https://escapefromtarkov.gamepedia.com/Armbands",
"https://escapefromtarkov.gamepedia.com/Ammunition",
"https://escapefromtarkov.gamepedia.com/Weapon_mods",
"https://escapefromtarkov.gamepedia.com/Meds",
"https://escapefromtarkov.gamepedia.com/Consumables",
"https://escapefromtarkov.gamepedia.com/Loot",
"https://escapefromtarkov.gamepedia.com/Keys_%26_Intel",
"https://escapefromtarkov.gamepedia.com/Containers"
];
for($i = 0; $i < count($urls); $i++)
{
$html = file_get_contents($urls[$i]);
$wiki_doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$wiki_doc->loadHTML($html);
libxml_clear_errors();
$wiki_xpath = new DOMXPath($wiki_doc);
$wiki_row = $wiki_xpath->query('//table[#class="wikitable"]/tbody/tr/td/a/img/#src');
foreach($wiki_row as $row)
{
$row->nodeValue;
}
}
I am looking to get the image src's of each of images within tables with a class of 'wikitable', however when I run this I get no results.

The tbody element is added by the browser. The developer tools DOM view shows a a cleaned up/repaired/unified HTML DOM of the page. Look at the actual source.
<table class="wikitable sortable">
<tr>
<th>Name
</th>
<th>Image
</th>
<th>Cartridge
</th>
<th>Description
</th></tr>
<tr>
<td>AK-101
</td>
<td><a href="/AK-101" title="AK-101"><img alt="AK101 Image.png" src="https://d1u5p3...
Here is no tbody and the class does not contain just wikitable. That can be matched in Xpath 1.0 but it needs a little string magic:
//table[contains(concat(' ', normalize-space(#class), ' '), ' wikitable ')]/tr/td/a/img/#src

There are a couple of problems with the XPath, the first is that using #class="wikitable" meant that if there are other classes in the element, this won't work. You should instead say if the class contains the class your after. The second is that there isn't a <tbody> element in the original document. So the XPath line should be
$wiki_row = $wiki_xpath->query('//table[contains(#class,"wikitable")]/tr/td/a/img/#src');

Related

How to get child element of DOMDocument

Trying to get the specific value from a table with tr and td elements...
HTML:
<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>
PHP:
$html = 'http://www.example.com'; // edited
$dom = new DOMDocument;
#$dom->loadHTML($html);
$data = $dom->getElementsByTagName('tr:nth-child(3n)');
foreach ($data as $datas){
echo $link->nodeValue;
}
Using such or different approach, how to get the value of specific td element... ?
Using getElementsByTagName() returns a list of the tags based on your starting point, so once you've found the table, you can then use the same function to get the <td> tags. You can then just pick out the elements your after...
$data = "<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>";
$dom = new DOMDocument;
$dom->loadHTML($data);
$table = $dom->getElementsByTagName('table');
$td = $table[0]->getElementsByTagName('td'); // Fetch all td elements in the first table
echo $td[2]->nodeValue; // Echo out the value of the 3rd item (zero based arrays)
Prints out..
value3
xPath can be used to get particular element. Try the following code to get 3rd td value from given html.
$html = '<table>
<tr>
<td>value1</td>
<td>value2</td>
<td>value3</td>
</tr>
</table>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$table = $dom->getElementsByTagName('table')->item(0);
$query = 'tr/td[3]';
$entries = $xpath->query($query, $table);
echo $entries[0]->nodeValue;
Read about DOMXpath query()
Update: Use of file_get_content is also simple, you can retrieve html/xml as string in $html variable and rest of the process is same:
$html = file_get_contents("path/to/file/x.html"); // target path
I think this should help you.
Just assign class Name for td elements and then use this
for eg:
<td class="two">value2</td>
$(this).closest('tr').children('td.two').text();

PHP DOM/xpath check element span class value

Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent".
I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber.
Thanks in advance.
<table>
...
<tbody>
<tr>
<td></td>
<td></td>
<td>
<span class="">Product</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
<tr>
<td></td>
<td></td>
<td>
<span class="subcomponent">Component</span>
</td>
<td>Version</td>
<td>Partnumber</td>
</tr>
</tbody>
My PHP code
$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[#class=""]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->C14N();
}
To access the table rows themselves using SimpleXML, you can use the following:
$sxml = simplexml_load_string('<table>...</table>');
$rows = $sxml->xpath('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}
The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.
In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.
See https://eval.in/860169 for an example.
Alternative DOMDocument Version
If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:
$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[#class=""]]');
foreach ($rows as $row) {
$cells = $row->getElementsByTagName('td');
$version = $cells->item(3)->nodeValue;
$partNumber = $cells->item(4)->nodeValue;
echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}
See https://eval.in/860217
I would use next XPath expression:
//td[text()="Version"] | //td[text()="Partnumber"]
Which gives me:
Element='<td>Version</td>'
Element='<td>Partnumber</td>'
Element='<td>Version</td>'
Element='<td>Partnumber</td>'

php domdocument or domxpath: how to extract TRs and save html

I have been struggling with this all day.
I have an html table in a string.
<TABLE>
<TBODY>
<TR CLASS=dna1>
<TD></TD><TD></TD><TD></TD><TD></TD>
</TR>
<TR CLASS=dna2>
<TD></TD><TD></TD><TD></TD><TD></TD>
</TR>
repeat...
Inside the <TD> are some <DIV> and <SPAN> that I need to work with.
I need to extract each <TR> (both classes) and save the html in an array where each <TR> is an array element.
Creating a node list array is easy enough, but how do I get the actual html?
If you must save the HTML as a string, there is DOMDocument::saveHTML
$elems = $xpath->query('//tr');
foreach ($elems as $elem) {
$array[] = $doc->saveHTML($elem);
}
(Note that the parameter for saveHTML is available as of PHP 5.3.6.)
I'd recommend saving the nodes themselves, though, and converting them to string only shortly before you output them.
Alternatively using DOMDocument only:
$dom = new DOMDocument();
#$dom->loadHTML($html);
if($table=$dom->getElementsByTagName('table')->item(0)){
//traverse the table and output every rows
$rows=array();
foreach ($table->childNodes as $row){
$rows[]=$dom->saveHTML($row);
}
}

Not getting the Tag's nodeValue using DOMDocument class in PHP?

i am using DOMDocument class, to parse HTML document in PHP.
the code of table i am using...
<table>
<tr>
<td> 123 employees </td>
</tr>
<tr>
<td> $50,000 </td>
</tr>
</table>
i am not able to fetch nodeValue of the tag, which are like in the above format,
i.e ($50,000, 123 employees ).
Maybe you could use XPath queries. For example:
$doc = new DOMDocument();
$doc->loadHTML(...your html...);
$xpath = new DOMXPath($doc);
$query = '//table/tr/td';
$entries = $xpath->query($query);
foreach ($entries as $entry)
print_r($entry->nodeValue); (*)
Just replace the line marked with (*) with code that suites your needs.
Cheers.
It's hard to say what you are doing wrong if you don't say, what you are doing at all.
But in principle you must get a reference to the table (as a DOM element), then get the rows and cells (child nodes) and get the node value from there.
If your table is the only one in the document, you'll get it via $doc->getElementsByTagName('table')->item(0) (with $doc being the DOM document object).

simplehtmldom php: How do you search for one thing or another

I want to scrape some html with simple html dom in php. I have a bunch of tags containing tags. The tags I want alternate between bgcolor=#ffffff and bgcolor=#cccccc. There are some tags that have other bg colors.
I want to get all the code in each tag that has either bgcolor=#ffffff or bgcolor=#cccccc. I can't just use $html->find('tr') as there are other tags that I don't want to find.
Any help would be appreciated.
you can use simplehtmldom too
this is my solution for your problem
<?php
include_once "simple_html_dom.php";
// the html code example
$html = '<table>
<tr bgcolor="#ffffff"><td>1</td></tr>
<tr bgcolor="#cccccc"><td>2</td></tr>
<tr bgcolor="#ffffff"><td>3</td></tr>
</table>';
// in this case I load the html code via string
$code = str_get_html($html);
// find elem by attribute
$trs = $code -> find('tr[bgcolor=#ffffff]');
foreach($trs as $tr){
echo $tr -> innertext;
}
$trs = $code -> find('tr[bgcolor=#cccccc]');
foreach($trs as $tr){
echo $tr -> innertext;
}
?>
You could load the DOM into a simplexml class and then use xpath, like so:
$xml = simplexml_import_dom($simple_html_dom);
$goodies = $xml -> xpath('//[#bgcolor = "#ffffff"] | //[#bgcolor = "#cccccc"]');
you might even be able to put that OR syntax within the same set of brackets, but I'd need to double check.
Update:
Sorry, I thought you were talking about the DOM extension. I just looked up simpledomhtml, and it appears that its find feature is loosely based on XPath. why not just do:
$goodies = $html -> find('[bgcolor=#ffffff], [bgcolor="#cccccc]');

Categories