Need help scraping webpage -- getting specific content... - php

I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.
Here is a simplified table:
<table>
<tbody>
<tr class='header'>
<td>Image</td>
<td>Name</td>
<td>Time</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 1</td>
<td>13:02</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 2</td>
<td>13:43</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 3</td>
<td>14:53</td>
</tr>
</tbody>
</table>
I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.
I was wondering if there's any way to do this with DomDocument/DomXPath. Perhaps search for the string "Name" in the first tr, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img tag, in which case the image column is first and so we can throw that way and use the next one.
Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with this one.

Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;
$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
if ($element->innerText == 'Image') { $num = $i; }
$i++;
}
We found which column ($num) is image column. You can add additional codes to improve.
PS: Easy way to find all image sources;
$images = $html->find('tr td img');
foreach ($images as $image){
$imageUrl[] = $image->src;
}

Related

How should I change this scraping PHP script to work with this table?

I'm using this code to scrape some info from a table on a website. One example I have works, because it has a row of th, followed by tr, td (the th is the first row above the other rows horizontally).
$dom = new \simple_html_dom($html);
$rows = $dom->find('table.table-bordered tbody tr');
$header = [];
foreach ($rows as $row) {
if(!empty($header)) break;
foreach ($row->find('th') as $key=>$th) {
$header[] = trim(html_entity_decode($th->plaintext));
}
}
$cells = [];
foreach ($rows as $row) {
$cell = [];
foreach ($row->find('td') as $key=>$td) {
$cell[$header[$key]] = trim(html_entity_decode($td->plaintext));
}
if(!empty($cell)) {
$cells[] = $cell;
}
}
The problem is that another example table I have has a different structure and I'm unsure how to change the code to reflect it. The th is on each row vertically as the first column of the table. Thus the first th gets repeated in the output as the key for all rows.
<table class="table table-bordered">
<tbody>
<tr>
<th> Sender </th>
<td> Test </td>
</tr>
<tr>
<th> Number </th>
<td> 1234 </td>
</tr>
<tr>
</tbody>
</table>
There is also a second table with no class nor id, which I would like to get separately. Is there a way to skip the first table?
<table class="table">
<tbody>
<tr>
<th> Table 2 cell 1 </th>
<td> Test table 2 </td>
</tr>
<tr>
<th> Number something </th>
<td> 1234 table 2 </td>
</tr>
<tr>
</tbody>
</table>
The output looks like this (json encoded):
[{"Sender":"Test"},{"Sender":"1234"},{"Sender":"Test table 2"},{"Sender":"1234 table 2"}]
Should be:
[{"Sender":"Test"},{"Number":"1234"},{"Table 2 cell 1":"Test table 2"},{"Number something":"1234 table 2"}]
Or ignoring the first table table table-bordered:
[{"Table 2 cell 1":"Test table 2"},{"Number something":"1234 table 2"}]
Sender should not be the key for each row. What should be changed in the PHP code to read this table correctly? I don't think the $dom->find is actually finding single rows and then looking for th and td inside.
I think the following line of code will let you scrape the second table only. When you write like this [class='table'] then the selector will ignore all the compound classes having the same portion within, meaning it will look for the class only having table.
Replace the following line with the existing one being used within your script:
$rows = $dom->find("[class='table'] tbody tr");

Extract table data from HTML page in php

I have a html table with multiple rows and each row with multiple columns. A Sample for one row looks like this.
<table class ="classt">
<tbody>
<tr class="row">
<td height="20" valign="top" class="mosttext-new">data</td>
<td height="20" valign="top" class="mosttext-new"> data</td>
<td height="20" valign="top" class="mosttext-new">data</td>
</tr>
</tbody>
</table>
I am trying to extract all td elements like this in a php script.
foreach($html->find('table.classt') as $e){
foreach ($e->find('tr.row') as $tr){
foreach ($tr->find('td') as $td){
$text = $td->innertext;
}
}
}
But in $tr I am not getting row details with td tags. It is just coming the entire row withing double quotes like this
"data data data"
so my third loop is not able to find td as $tr does not have td tags.
Any idea on this?
I think you have to mention the class name after the 'td' followed by '.' like this
foreach ($tr->find('td.mosttext-new') as $td)
Hope this should solve your problem. All the best.

How to use DOMDocument to get child elements?

I am trying to get the text of child elements using the PHP DOM.
Specifically, I am trying to get only the first <a> tag within every <tr>.
The HTML is like this...
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
My sad attempt at it involved using foreach() loops, but would only return Array() when doing a print_r() on the $aVal.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(returnURLData($url));
libxml_use_internal_errors(false);
$tables = $dom->getElementsByTagName('table');
$aVal = array();
foreach ($tables as $table) {
foreach ($table as $tr){
$trVal = $tr->getElementsByTagName('tr');
foreach ($trVal as $td){
$tdVal = $td->getElementsByTagName('td');
foreach($tdVal as $a){
$aVal[] = $a->getElementsByTagName('a')->nodeValue;
}
}
}
}
Am I on the right track or am I completely off?
Put this code in test.php
require 'simple_html_dom.php';
$html = file_get_html('test1.php');
foreach($html->find('table tr') as $element)
{
foreach($element->find('a',0) as $element)
{
echo $element->plaintext;
}
}
and put your html code in test1.php
<table>
<tbody>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
<tr>
<td>
1st Link
</td>
<td>
2nd Link
</td>
<td>
3rd Link
</td>
</tr>
</tbody>
</table>
I am pretty sure I am late, but better way should be to iterate through all "tr" with getElementByTagName and then while iterating through each node in nodelist recieved use getElementByTagName"a". Now no need to iterate through nodeList point out the first element recieved by item(0). That's it! Another way can be to use xPath.
I personally don't like SimpleHtmlDom because of the loads of extra added features it uses where a small functionality is required. In case of heavy scraping also memory management issue can hold you back, its better if you yourself do DOM Analysis rather than depending thrid party application.
Just My opinion. Even I used SHD initially but later realized this.
You're not setting $trVal and $tdVal yet you're looping them ?

DOM Document PHP Replace TD

I would like to be pointed in the right direction on how I would go about editing data (not headings) of a table using PHP DOM Document.
I have been looking into PHP DomDocument to replace the content of "Name 1" and "Age 1" etc, with real data from a database, however I am having a few issues...
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile('template.html');
$sql = 'SELECT name,
age
FROM db.people';
$sql = mysql_query($sql);
for($i=0; $person = mysql_fetch_assoc($sql); $i++)
{
$doc->getElementsByTagName('td')->item($i)->nodeValue = $person['name'];
}
$doc->formatOutput = TRUE;
echo $doc->saveHTML();
?>
I would like to continue editing the above PHP code to replace place holder data with data from a database.
<table>
<tr>
<th>Name</th>
<th>Age</th>
</tr>
<tr>
<td>Stephanie</td>
<td>22</td>
</tr>
<tr>
<td>Martin</td>
<td>45</td>
</tr>
<tr>
<td>Sarah</td>
<td>61</td>
</tr>
<tr>
<td>Kevin</td>
<td>12</td>
</tr>
</table>
Can anyone point me in the right direction, and if i'm on the right track?
Both assignments inside the loop are assigning to exactly the same element, except the top one assigns "name" to the element, and the bottom one assigns "age" to the element. So Age always wins.

PHP DOM grabbing a specific subset of information

The webpage in question is http://assignments.uspto.gov/assignments/q?db=pat&pub=20060030630
Now, let's just say I want to capture the Assignees in the first assignment. The relevant code there looks like
<div class="t3">Assignee:</div>
</td>
</tr>
</table>
</td><td>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody valign="top">
<tr>
<td>
<table>
<tr>
<td>
<div class="p1">
LEAR CORPORATION
</div>
</td>
</tr>
<tr>
<td><span class="p1">21557 TELEGRAPH ROAD</span></td>
</tr>
<tr>
<td><span class="p1">SOUTHFIELD, MICHIGAN 48034</span></td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
I could I suppose use xpath and grab everything out of spans with class p1, except that thing is used all throughout the page for basically everything, same for the div class that lear corporation is in.
So is there a way for me to just read "Assignees" and then grab just the information relevant to it?
I figure if I can understand how to do that, then I can extrapolate from that and figure out how to grab any specific data on the page that I want, i.e. grabbing the conveyance data on any particular assignment.
But if say, I were just to grab all the data on the page (reel/frame, conveyance, assignors, assignee, correspondent for every assignment, and the header information about the patent itself), might that be easier to do than trying to grab each individual piece of information?
There is no clear way to do it since we have no designation in the DOM where this information is.. It's very arbitrary.
I would recommend using some math to figure out the pattern of where in the DOM the Assignee resides.
For example, we know that for every class of p1, the assignee value is position 16, and a new Assignment occurs every 23rd position. Using a loop you could figure it out.
This should get you started at the very least.
$Site = file_get_contents('http://assignments.uspto.gov/assignments/q?db=pat&pub=20060030630');
$Dom = new DomDocument();
$Dom->loadHTML($Site);
$Finder = new DomXPath($Dom);
$Nodes = $Finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' p1 ')]");
$position = 0;
foreach($Nodes as $node) {
if(($position % 16) == 0 && $position > 0) {
var_dump($node->nodeValue);
break;
}
$position++;
}

Categories