How To Format This Scraped Content

How To Format This Scraped Content - php

I'm grabbing the content from all the td's in this table with the class="job" using this.
$table01 = $salary->find('table.table01');
$rows = $table01[0]->find('td.job');
Then I'm using this to output it which works, but obviously only outputs it as plaintext, I need to do some more with it...
foreach($table01[0]->find('td.job') as $element) {
$jobs .= $element->plaintext . '<br />';
}
Ultimately I would like it outputted to this format. Notice the a href is using the job name and replacing spaces and / with a -.
<tr>
<td class="small"> Graphic Artist / Designer
$23,755 – $55,335 </td>
</tr>
<tr>
<td class="small"> Sales Associate<br />
$15,577 – $56,290 </td>
</tr>
<tr>
<td class="small"> Film / Video Editor<br />
$24,184 – $94,493 </td>
</tr>
Heres the table im scraping
<table cellpadding="0" cellspacing="0" border="0" class="table01">
<tr>
<td class="head">Test</td>
<td class="job">
Graphic Artist / Designer<br/>
$23,755 – $55,335
</td>
</tr>
<tr>
<td class="head">Test</td>
<td class="job">
Sales Associate<br/>
$15,577 – $56,290
</td>
</tr>
<tr>
<td class="head">Test</td>
<td class="job">
Film / Video Editor<br/>
$24,184 – $94,493
</td>
</tr>
</table>

may be better to use regexps
<?php
$html=file_get_contents('1.html');
$jobs='';
if(preg_match_all("/<tr>.*?<td.*?>.*?<\/td>.*?<td\sclass=\"job\">.*?<a.+?href=\"(.+?)\".+?>(.*?)<\/a>(.*?)<\/td>.*?<\/tr>/ims", $html, $res))
{
foreach($res[1] as $i=>$uri)
{
$uri=strtolower(urldecode($uri));
$uri=preg_replace("/_\/_/",'-',$uri);
$uri=preg_replace("/_/",'-',$uri);
$jobs.='<tr><td class="small"> '.$res[2][$i].''.$res[3][$i].'</td></tr>'."\n";
}
}
echo $jobs;

Related

Prevent table resizing in PDF

I'm using mpdf to generate PDF from a form. In form I have an option to adding new rows to table. The problem is when count of rows is too big for generated PDF page. Then the table is resizing (it's smaller) instead of going to the next page.
This is mpdf code:
$mpdf=new mPDF('UTF-8','A4','','',20,15,48,25,10,10);
$mpdf->WriteHTML(generatePDF());
$mpdf->Output();
exit;
This is html table code:
function getHTMLStyle(){
$html ='<table class="items" width="100%" style="font-size: 9pt; border-collapse: collapse;" cellpadding="8">
<tr>
<td width="5%">A</td>
<td width="95%"><b>'.$a.'</b><br /><br /> '.$_POST['title'].'</td>
</tr>
<tr>
<td >B</td>
<td ><b>'.$b.'</b><br /><br /> '.$_POST['organizationName'].'</td>
</tr>
<tr>
<td >C</td>
<td></td>
<table class="items2" width="100%" page-break-before="always" >
<tr>
<td ><b>'.$c.'</b></td>'.addTableC().'
</tr>
</table>
</tr>
This is image with property view:
And this is an amage with wrong view:
How can I make a break in table and continue to another side?

Because you're incorrectly nesting tables -
<tr>
<td >C</td>
<td></td>
<table class="items2" width="100%" page-break-before="always" >
<tr>
<td ><b>'.$c.'</b></td>'.addTableC().'
</tr>
</table>
</tr>
The table should be inside the <td> tag, like so:
<tr>
<td>C</td>
<td>
<table class="items2" width="100%" page-break-before="always" >
<tr>
<td ><b>'.$c.'</b></td>'.addTableC().'
</tr>
</table>
</td>
</tr>

IDE showing error in PHP concatenation

I am trying to display name and address in mail using the following piece of code this is perfect-
</tr>
<tr>
<td align="left"><p>'.$admin_message['message'].'</P></td>
</tr>
<tr>
<td align="left"><p>Kind Regards</P></td>
</tr>
<tr>
<td align="left"><p>'.ucwords($finance_info['fi_schoolname']).
'<br />'.ucwords($finance_info['fi_address']).
' <br /> Email :- '.ucwords($finance_info['fi_email']).
'<br /> Contact:- '.ucwords($finance_info['fi_phoneno']).'</P>
</td>
</tr>
Now I want to add a image logo in it for which I am trying like this -
<tr>
<td align="left"><p>'.$admin_message['message'].'</P></td>
</tr>
<tr>
<td align="left"><p>Kind Regards</P></td>
</tr>
<tr>
<td align="left"><p>'
if($_SESSION['eschools']['schoollogo']!=""){ echo displayimage("images/school_logo/".$_SESSION['eschools']['schoollogo'], "140"); }
.' '.
ucwords($finance_info['fi_schoolname']).'<br>
'.ucwords($finance_info['fi_address']).' <br />
Email :- '.ucwords($finance_info['fi_email']).'<br>
Contact:- '.ucwords($finance_info['fi_phoneno']).'</P></td>
</tr>
</table>';
But on Line No. 9 and 10 shows error by IDE I am using Dreamweaver. I need to know where I am doing mistake in concatenation.
Regards to all

this is wrong :
<td align="left"><p>'
if($_SESSION['eschools']['schoollogo']!=""){ echo displayimage("images/school_logo/".$_SESSION['eschools']['schoollogo'], "140"); }
correct code :
<?php
echo '
<tr>
<td align="left"><p>'.$admin_message['message'].'</P></td>
</tr>
<tr>
<td align="left"><p>Kind Regards</P></td>
</tr>
<tr>
<td align="left"><p>';
if($_['eschools']['schoollogo']!=""){ echo displayimage("images/school_logo/".$_['eschools']['schoollogo'], "140"); }
echo '    '.
ucwords($finance_info['fi_schoolname']).'<br>
'.ucwords($finance_info['fi_address']).' <br />
Email :- '.ucwords($finance_info['fi_email']).'<br>
Contact:- '.ucwords($finance_info['fi_phoneno']).'</P></td>
</tr>
</table>';

You don't have a semicolon after the ' after the <p> in line 8.

Extracting table cell text contents with xpath in rows for consumption?

I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
<blockquote>
<p class="textstyle">
Text.
</p>
</blockquote>
My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.
How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?
Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.

The provided text isn't well-formed XML document, therefore XPath isn't applicable.
If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:
/*/TABLE//TD//text()
or even:
//TABLE//TD//text()
Here is a wellformed XML document, constructed from the provided HTML:
<html>
<p align="center">
<img src="some_image.gif" alt="Some Title"/>
</p>
<TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD colspan="4" ALIGN="center">
<b>Title</b>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title</TD>
<TD ALIGN="center">date</TD>
<TD ALIGN="center">value</TD>
<TD ALIGN="center">value</TD>
</TR>
<TR>
<TD ALIGN="center">Title2</TD>
<TD ALIGN="center"></TD>
<TD ALIGN="center">
<div class="redtext">----</div>
</TD>
<TD> </TD>
</TR>
<TR>
<TD ALIGN="center">Title3</TD>
<TD ALIGN="center">
<div class="yellowtext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD ALIGN="center">value
<SUP>6</SUP>
</TD>
</TR>
<TR>
<TD ALIGN="center">Title4</TD>
<TD ALIGN="center">
<div class="bluetext">value</div>
</TD>
<TD ALIGN="center">
<div class="redtext">value</div>
</TD>
<TD> </TD>
</TR>
</TABLE>
<blockquote>
<p class="textstyle"> Text. </p>
</blockquote>
</html>

So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:
import re
from cStringIO import StringIO
from lxml import etree
def getTable(html, table_xpath, rows_xpath, cells_xpath):
"""Get a table on a webpage"""
parser = etree.HTMLParser()
# Build document tree and get table
root = etree.parse(StringIO(html), parser)
table = root.find(table_xpath)
if table == None:
print 'No table.'
return []
rows = table.findall(rows_xpath)
document = []
def cleanText(text):
"""Clean up text by replacing line breaks and tabs. """
return re.sub(r'[\r\n\t]+','',str(text).strip())
# iterate over the table rows and collect text from each cell.
for r in rows:
cells = r.findall(cells_xpath)
rowdata = []
for c in cells:
text = ''
it = c.itertext()
for i in it:
text += cleanText(i) + ' '
rowdata.append(text)
document.append(rowdata)
return document
html = """
<html><head><title></title></head><body>
<p align="center">
<img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
<TD ALIGN=center>Title2</TD>
<TD ALIGN=center></TD>
<TD ALIGN=center><div class=redtext>----</div></TD>
<TD> </TD>
</TR><TR>
<TD ALIGN=center>Title3</TD>
<TD ALIGN=center><div class=yellowtext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
<TD ALIGN=center>Title4</TD>
<TD ALIGN=center><div class=bluetext>value</div></TD>
<TD ALIGN=center><div class=redtext>value</div></TD>
<TD> </TD>
</TR></TABLE>
</body>
</html>
"""
tp = "//table[#width='500']"
rt = "tr"
cp = "td[#align='center']"
doc = getTable(html, tp, rt, cp)
print repr(doc)

I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?
It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:
import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
print "New row"
for y in x.xpath("td"):
print stringify(y)
Output:
New row
test
New row
test2
The following code will, however, get the list you ask for:
print map(stringify, tree.xpath("//table/tr/td"))
Output:
['test', 'test2']
This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.
(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)

PHP MySQL populating values from database

lets say i retrieve all of the values where their position belongs to top8.I populate them out in a table and instead of displaying different kinds of values , it displays 3 tables with 3 different values, how is this so? any help so that different values belonging to certain values will all be displayed out? i only need one table with 3 different values.
<?
$facebookID = "top8";
mysql_connect("localhost","root","password") or die(mysql_error());
mysql_select_db("schoutweet") or ie(mysql_error());
$data= mysql_query("SELECT schInitial FROM matchTable WHERE position='".$facebookID."'")
or die(mysql_error());
while($row = mysql_fetch_array($data))
{
?>
<center>
<table border="0" cellspacing="0" cellpadding="0" class="tbl_bracket">
<tr>
<td class="brack_under cell_1"><a href="www.facebook.com"/>team 1.1><?= $row['schInitial']?><a/></td>
<td class="cell_2"> </td>
<td class="cell_3"> </td>
<td class="cell_4"> </td>
<td class="cell_5"> </td>
<td class="cell_6"> </td>
</tr>
<tr>
<td class="brack_under_right_up">team 1.2><?= $row['schInitial']?></</td>
<td class="brack_right"><!--1.2.1--></td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td class="brack_right"><!--2.1--></td>
<td class="brack_under"><!--3.1--></td>
<td><!--here?--></td>
<td><!--there?--></td>
<td><!--everywhere?--></td>
</tr>
</table>
</center>
<?
}
?>
</body>

That's because your <table> tag is within the loop! Place the <table> tag outside the while loop.

place your table tags outside the while loop

Because your writing the table tag inside the while loop. Everything inside the loop is done each loop cycle. If you only want to have one table in the output, you'll have to open and close the table outside of the loop, like this:
$data= mysql_query("SELECT schInitial FROM matchTable WHERE position='".$facebookID."'")
or die(mysql_error());
?>
<center>
<table border="0" cellspacing="0" cellpadding="0" class="tbl_bracket">
<?
while($row = mysql_fetch_array($data))
{
?>
<tr>
<td class="brack_under cell_1"><a href="www.facebook.com"/>team 1.1><?= $row['schInitial']?><a/></td>
<td class="cell_2"> </td>
<td class="cell_3"> </td>
<td class="cell_4"> </td>
<td class="cell_5"> </td>
<td class="cell_6"> </td>
</tr>
<tr>
<td class="brack_under_right_up">team 1.2><?= $row['schInitial']?></</td>
<td class="brack_right"><!--1.2.1--></td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td class="brack_right"><!--2.1--></td>
<td class="brack_under"><!--3.1--></td>
<td><!--here?--></td>
<td><!--there?--></td>
<td><!--everywhere?--></td>
</tr>
<?
}
?>
</table>
</center>
That will, however, print three rows per loop and therefore per record (but you have references to the table contents in two of them, so I suppose that's what you want?).
Also take care about some not well-formed HTML you have there (e.g. the > character in the expression team 1.1> / team 1.2>. If you want to print the > character to the browser, encode it as HTML entity (> for this case). You also have a probably superfluous </ in the first column of the second row (</</td>).

you need to echo the HTML part as well in the while loop like
echo '<table>';

Can php code execute while creation of pdf using tcpdf?

I m working on module in which i have to make pdf from php page. I m Using tcpdf for that but m facing one problem that file contain some mysql queries and php coding which is not executed by pdf page.
$prn_no = $_POST['prn_no'];
$current_sem = $_POST['current_sem'];
$qr_fetch_sem_res_id = mysql_query("SELECT * FROM table1 WHERE ((prn='$prn_no') AND (semisterName='$current_sem'))")or die(mysql_error());
$qr_fetch_sem_result_ans = mysql_fetch_array($qr_fetch_sem_res_id);
<tr>
<td colspan="11" align="left" valign="middle">Programme Name: <?php echo $qr_fetch_sem_result_ans['programme_name'];?></td>
</tr>
<tr>
<td colspan="11" align="center" valign="middle"><table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="27%">Seat No.: <?php echo $qr_fetch_sem_result_ans['seatNo'];?></td>
<td width="3%"> </td>
<td width="22%">PR No. : <?php echo $qr_fetch_sem_result_ans['prn'];?></td>
<td width="2%"> </td>
<td width="17%">Semester : <?php echo $qr_fetch_sem_result_ans['semisterName'];?></td>
<td width="1%"> </td>
<td width="25%">Month / Year Of Exam : <?php echo $qr_fetch_sem_result_ans['month_year_of_exam'];?> </td>
<td width="3%"> </td>
</tr>
<tr>
<td colspan="3">Name: <?php echo $qr_fetch_sem_result_ans['student_name'];?></td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td colspan="7">College / Institute: <?php echo $qr_fetch_sem_result_ans['institute_name'];?></td>
<td> </td>
</tr>
</table></td>
</tr>

I'm going to go out on a limb here and suggest that you run your queries fist and then build your pdf file. If you run the queries after you build the pdf then of course it will not have access to your data. If that doesn't help then I must not understand what you're asking.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How To Format This Scraped Content - php

Related

Prevent table resizing in PDF

IDE showing error in PHP concatenation

Extracting table cell text contents with xpath in rows for consumption?

PHP MySQL populating values from database

Can php code execute while creation of pdf using tcpdf?

Categories

Resources