php: how to assign scraped html to array - php

I want to format what is output by the following php script:
<?php
$stop = $_POST["stop_number"]; // stop_number is an text input value provided by user
$depart_url = "http://64.28.34.43/hiwire?.a=iNextBusResults&StopId=" . $stop;
$html = file_get_contents($depart_url);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$my_xpath_query = "//td[#valign='top']";
$result = $xpath->query($my_xpath_query);
foreach($result as $result_object)
{
echo $result_object->childNodes->item(0)->nodeValue,'<br>';
}
?>
Here is the output (at least in one instance, as the data changes over time).
18 - GOLD
OUTBOUND
8:17p
8:16p
8 - GREEN
OUTBOUND
8:46p
8:46p
8 - GREEN
OUTBOUND
18 - GOLD
OUTBOUND
5 - PLUM
OUTBOUND
EDIT:
I want the output info above to go in a table such as the one below. However instead of the text between tags, it would be variables, or items from the php script output.
<!DOCTYPE html>
<html>
<title>Departure Table</title>
<body>
<h4>Next Departures for Stop Number: __ </h4>
<table border="1px solid black">
<tr>
<th>Route</th>
<th>Direction</th>
<th>Scheduled</th>
<th>Estimated</th>
</tr>
<tr>
<td>18 - Gold</td>
<td>Outbound</td>
<td>8:17p</td>
<td>8:16p</td>
</tr>
<tr>
<td>8 - Green</td>
<td>Outbound</td>
<td>8:46p</td>
<td>8:46p</td>
</tr>
</table>
</body>
</html>

Try appending a \n tag after your echo statement:
echo $result_object->childNodes->item(0)->nodeValue."\n";
EDIT:
If you want to store your data in PHP variables, you could do something like this:
Store data in an array like variable (or any other data structure as per your needs) and iterate over the variable.
$store_data_in_array_variable = array();
foreach($result as $result_object)
{
$store_data_in_array_variable[] = $result_object->childNodes->item(0)->nodeValue;
}
//iterate over all stored values
foreach ($store_data_in_array_variable as $key => $value)
{
echo $key;
echo '<br>';
echo $value;
}

Related

Getting DOM elements of html from file_get_contents [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.
Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.

How to extracting Data from HTML table using php

I keep trying different methods of extracting the data from the HTML table such as using xpath. The table(s) do not contain any classes so I am not sure how to use xpath without classes or Id. This data is being retrieved from an rss xml file. I am currently using DOM. After I extract the data, I will try to sort, the tables by Job Title
Here is my php code
$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");
/*What we do here in this loop is retrieve all content inside the encoded content,
*which includes the CDATA information. This is where the HTML and styling is included.
*/
foreach($xml->channel->item as $cont){
$html=''.$cont->children('content',true)->encoded.'<br>'; //actual tag name is encoded
}
$htmlParser= new DOMDocument(); //to parse html using DOMDocument
libxml_use_internal_errors(true); // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html); //Loaded the html string we took from simple xml
$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');
foreach($rows as $row){
$cols = $row->getElementsByTagName('td');
echo $cols;
}
This is the HTML I am extracting info from
<table cellpadding='1' cellspacing='2'>
<tr>
<td><b>Job Title:</b></td>
<td>Job Example </td>
</tr>
<tr>
<td><b>Job ID:</b></td>
<td>23992</td>
</tr>
<tr>
<td><b>Job Description:</b></td>
<td>Just a job example </td>
</tr>
<tr>
<td><b>Job Category:</b></td>
<td>Work-study Position</td>
</tr>
<tr>
<td><b>Position Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Applicant Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>Active</td>
</tr>
<tr>
<td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
</tr>
</table>
You can use xpath to query('//td') and retrieve the td html using C14N(), something like:
$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}
Output:
<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...
C14N();
Returns canonicalized nodes as a string or FALSE on failure
Update:
Another question, how can I grab individual Table Data? For example,
just grab, Job ID
Use XPath contains, i.e.:
foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}
Update V2:
How can I get the next Table Data after that (to actually get the Job
Id) ?
Use following-sibling::*[1], i.e:
echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992
$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
echo $tableDataNodes[$x];
}

Getting variables from SQL Server for mPDF

I'm using the mPDF class to output a pdf of data from a PHP file. I need to loop through a SQL Server query, save as new variables and write into the $html so it can be outputted to the pdf. I can't place it in the WriteHTML function because it does not recognize PHP code. I need the contents of the whole array so I can't just print one variable.
I have two files:
pdf-test.php:
This file gathers session variables from other php files that are included and reassigns them, so I can use them in the $html.
<?php
// Include files
require_once("form.php");
require_once("configuration.php");
session_start();
$html = '
<h3> Form A </h3>
<div>
<table>
<thead>
<tr>
<th colspan="3">1. Contact Information</th>
</tr>
</thead>
<tr>
<td> First Name: </td>
<td> Last Name: </td>
</tr>
<tr>
<td>'.$firstName.'</td>
<td>'.$lastName.'</td>
</tr>
.
.
.
</table>
';
echo $html;
pdf-make.php:
This file holds the code to actually convert the contents of pdf-test.php into a pdf.
<?php
// Direct to the mpdf file.
include('mpdf/mpdf.php');
// Collect all the content.
ob_start();
include "pdf-test.php";
$template = ob_get_contents();
ob_end_clean();
$mpdf=new mPDF();
$mpdf->WriteHTML($template);
// I: send the file inline to the browser.
$mpdf->Output('cust-form-a', 'I');
?>
This is my loop:
$tbl = "form_Customers";
$sql = "SELECT ROW_NUMBER() OVER(ORDER BY custFirt ASC)
AS RowNumber,
formID,
custFirt,
custLast,
displayRecord
FROM $tbl
WHERE formID = ? and displayRecord = ?";
$param = array($_SESSION["formid"], 'Y');
$stmt = sqlsrv_query($m_conn, $sql, $param);
$row = sqlsrv_fetch_array($stmt);
while ($row = sqlsrv_fetch_array($stmt)) {
$rowNum = $row['RowNumber'];
$firstN = $row['custFirt'];
$lastN = $row['custLast'];
}
When I try to include $rowNum, $firstN or $lastN in the $html such as
<td> '.$rowNum.'</td>
, it just shows up blank.
I'm not sure where the loop should go (which file) or how to include the $rowNum, $firstN and $lastN variables in the $html like the others.
I'm new to PHP (and relatively new to coding in general) and I don't have much experience working with it, but I've been able to make mPDF work for me in similar instances without the query included.
Any help would be greatly appreciated. Thank you so much!
I'm not sure how your loop interacts with the other two files, but this looks overly complex to me. I'd approach this in one .php file, something sort of like this:
<?php
//Include Files
include('mpdf/mpdf.php');
... //Your additional includes
//Define a row template string
$rowtemplate =<<<EOS
<tr>
<td>%%RowNumber%%</td>
<td>%%custFirt%%</td>
<td>%%custLast%%</td>
</tr>
EOS;
//Initialize the HTML for the document.
$html =<<<EOS
<h3> Form A </h3>
... //Your code
<td> Last Name: </td>
</tr>
EOS;
//Loop Code
$tbl = "form_Customers";
... //Your code
$row = sqlsrv_fetch_array($stmt);
while ($row = sqlsrv_fetch_array($stmt)) {
//Copy rowtemplate to a temporary variable
$out_tmp = $rowtemplate;
//Loop through your SQL variables and replace them when they appear in the template
foreach ($row as $key => $val) {
$out_tmp = str_ireplace('%%'.$key.'%%', $val, $out_tmp);
}
//Append the result to $html
$html .= $out_tmp;
}
// Close the open tags in $html
$html .= "</table></div>";
//Write the PDF
$mpdf=new mPDF();
$mpdf->WriteHTML($html);
$mpdf->Output('cust-form-a', 'I');
I'm using heredoc syntax for the strings, since I think this is the cleanest way to include a large string.
Also, I prefer to omit the closing ?> tag as it introduces a stupid source of errors.

when creating multiple html tables in php, can one set the width to a variable?

I'm trying to create an admin page for a small web site. My plan is to create a table for each user with the relevant data put into individual cells.
Here's my code right now:
foreach ($user_data as $key => $element)
{
echo <<<EOT
<div><table border="1" style="width:100%">
EOT;
foreach ($element as $subkey => $sub_element)
{
echo <<<EOT
<td>$sub_element</td>
EOT;
}
echo <<<EOT
</table></div>
EOT;
The problem is the cells in each table are of different length so the data does not line up nice and neat under the column headings (not shown here). I'm wondering if there is a way (using CSS?) to have each cell be a different, but specific, length using a variable for the width. I'm thinking I could just use a counter to keep track of which cell is being created and use a different width for each number in the counter (i.e. an array of 5 different lengths that are looped through along with the data).
Am I even approaching this the right way?
Following is the solution I came up with. Is this best practice? Is there a better way?
<?php
$cell_width = array('20%','10%','10%','10%', '5%', '10%','25%','10%');
foreach ($user_data as $key => $element)
{
$counter = 0;
echo <<<EOT
<br /><div><table style="width:100%">
EOT;
foreach ($element as $subkey => $sub_element)
{
echo <<<EOT
<td width=$cell_width[$counter]>$sub_element</td>
EOT;
$counter++;
}
echo <<<EOT
</table></div>
EOT;
}
?>

I want php code to find href title and some other infos from html table

I create this code until now:
<?php
$url=" SOME HTML URL ";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href');
}
?>
I have html pages with tables so i want the link the title and the date. Example of html code:
<TR>
<TD align="center" vAlign="top" bgColor="#ffffff" class="smalltext">3</TD>
<TD class="plaintext" >THIS IS THE TITLE </TD>
<TD align="center" class="plaintext" >THIS IS DATE</TD>
</TR>
It works fine for me for the link, but i don't know how to take the others.
Tnx.
Where you are doing this:
$tags = $doc->getElementsByTagName('a');
You are getting back all the A tags. There only happens to be one.
If you want to get the text "THIS IS DATE", you're aren't going to get it by looking in A tags because the text is not inside an A tag - it is in a TD tag.
$tds = $doc->getElementsByTagName('td');
... would work to get all the TD elements, or you could assign an ID to the element you want to target and use getElementById instead.
Basically, though, this information is all in the documentation, which you absolutely should read before asking questions. Happy reading!
Once again, that's: http://php.net/manual/en/class.domdocument.php

Categories