I dont mean to be a bother and I know this has been asked a thousand times before but i'm just not understanding the concept. I was wondering if somebody could walk me through it, Here is what i'm trying to do:
I have a set of information inside an html file. The file is uploaded to the server and i need to parse information out of the file inside of set parameters (demo code to follow). I have been reading on parsing for over a week and understand some of it but just not grasping the concept, i guess i just need somebody to do one on this demo for me to understand and if you could, break down the search variables please. Here's the demo:
<hr>
<a id="Operating_System"></a>
<table WIDTH="100%" BORDER="0" CELLSPACING="0" ALIGN="CENTER">
<CAPTION ALIGN="TOP"><FONT size="5">Operating System</FONT></CAPTION>
<tr><td>Top</td></tr>
<TR ALIGN="LEFT" BGCOLOR="#00FF00">
<TH>Property</TH>
<TH>Value</TH>
</TR>
<TR BGCOLOR="#F0F0F0">
<TD>Name</TD>
<TD>Windows 7 Professional x64 Service Pack 1</TD>
</TR>
<TR>
<TD>Features</TD>
<TD>Terminal Services in Remote Admin Mode, 64 Bit Edition, Media Center Edition, Multiprocessor Free</TD>
</TR>
<TR BGCOLOR="#F0F0F0">
<TD>Up Time</TD>
<TD>5 Days 22 Hours 4 Minutes 26 seconds</TD>
</TR>
<!-- Operating System Duration: 1.853 seconds -->
</table>
<hr>
<a id="Installed_Updates"></a>
<table WIDTH="100%" BORDER="0" CELLSPACING="0" ALIGN="CENTER">
<CAPTION ALIGN="TOP"><FONT size="5">Installed Updates</FONT></CAPTION>
and here is what i'm trying to accomplish. On this demo, i would need the information parsed but only certain information to come back. there is a lot more information here but only need about 30 things total on each document. first i need to search from Operating_System to Installed_Updates, this will give me the first set area i need to gather information (there is other groups too so i'll make one for each group of information). The i need to make the search more specific such as from <TR> to </TR> which will give me the actual information set i need. After that just grap the first 'name' and 'value' to store in a database.
Again, i know it's out there but i'm just not getting the whole concept of simple expressions. After i do it a few times on an actual document, i'll get the hang of it i think.
Thank you all so much for the help, i really appreciate it.
This only works for fixed HTML with little variations. But if you just want a simple example, here is one:
preg_match('#<TD>Up Time</TD>.*?<TD>([\w ]+)</TD>#is', $html, $match);
print $match[1]; # ^^^^^^
See also https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for some tools. And http://regular-expressions.info/ to learn the syntax.
But as said, if you want to extract a lot of values, there are easier options.
Related
I'm using Html Agility Pack to run xpath queries on a web page. I want to find the rows in a table which contain a certain interesting element. In the example below, I want to fetch the second row.
<table name="important">
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
<tr>
<td>Stuff I'm interested in</td>
<td><interestingtag/></td>
<td>More stuff I'm interested in</td>
</tr>
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
<tr>
<td>Stuff I'm NOT interested in</td>
</tr>
</table>
I'm looking to do something like this:
//table[#name='important']/tr[has a descendant named interestingtag]
Except with valid xpath syntax. ;-)
I suppose I could just find the interesting element itself and then work my way up the parent chain from the node that's returned, but it seemed like there ought to be a way to do this in one step and I'm just being dense.
"has a descendant named interestintag" is spelled .//interestintag in XPath, so the expression you are looking for is:
//table[#name='important']/tr[.//interestingtag]
Actually, you need to look for a descendant, not a child:
//table[#name='important']/tr[descendant::interestingtag]
I know this isn't what the OP was asking, but if you wanted to find an element that had a descendant with a particular attribute, you could do something like this:
//table[#name='important']/tr[.//*[#attr='value']]
I know it is a late answer but why not going the other way around. Finding all <interestingtag/> tags and then select the parent <tr> tag.
//interestingtag/ancestor::tr
I would like to extract data from a website, whose code is written like this:
...
<tr>
<td class="something1"><a class="whatever" href="#">NAME</a> </td>
<td class="something2">DATA</td>
<td class="something3">NUMERIC DATA</td>
</tr>
...
In particular, I have my NAME list from my MySQL database, and if my NAME is equal to NAME on this website, I want to print on my website the correspondent NUMERIC DATA.
I know I can do something with php_simple_html_dom but I cannot really achieve this action. Can you please help me?
Thanks!
So you want to read NAME first. if relevant then read the rest? You can read a website Dom as explained here: How do I get the HTML code of a web page in PHP?
$html = file_get_contents('http://pathToTheWebsite.com/thePage');
Now lets parse the $html with some regex. (you can use that library too, the documentation tells you how to do it!
preg_match('/<td class="something1"><a class="whatever" href="#">(?<name>\w)</a> </td>/', $html, $matches);
now $matches['name'] will contain the NAME. You can do the same for the rest and maybe cleanup that regex a little this was just an example.
Here is my code mixed PHP and HTML.
(It shows 4X4 table.)
How can u make this code to get more readability?
If u change HTML indentation, U should change whole HTML.
(This code is part of whole source.)
Do u have any idea for me. Please let me know. Ta.
<!-- GALLERY BEGIN -->
<tr>
<td style="padding-top:5px;"><table width="100%" border="0" cellpadding="2" cellspacing="0">
<tr>
<?php
$ii = 0;
$aRows = getArticle('board_table', $DB_CONNECT, 8);
foreach ($aRows as $iidx => $aRow) :
$ii++;
$U = getUpfiles('upload_table',$aRow[uid],'');
$pic = getImage($U);
?>
<td width="75"><table width="60" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><img src="<?=$pic?>"/></td>
</tr>
<tr>
<td align="center"><?=getStrCut($aRow[subject], 8, '..')?></td>
</tr>
</table></td>
<?php if($ii%4==0): ?>
</tr>
</table></td>
</tr>
<tr>
<td><table width="100%" border="0" cellpadding="2" cellspacing="0">
<tr>
<?php endif; ?>
<?php endforeach; ?>
</tr>
</table></td>
</tr>
<!-- END BEGIN -->
I would suggest using some kind of templating engine. Like smarty. That will allow you to keep your design en data more seperate.
Don't even bother about the HTML indentation, it doesn't matter the least bit. Neither to your users, nor to your browser and for your debugging, you should use something like the Developer Tools from Chrome or good ol' Firebug - Those tools indent your HTML for you, no matter what the actual codebase looks like.
Indent the way you can read the code best, and not what you think is best for the program. As for that, you should really avoid mixing HTML and PHP at all. As bkwint already said, use templating engines like SMARTY or TWIG. Although, with them you still have some weird "in-between" code in your HTML, but it will look much cleaner and easier to understand.
As for your current code: Go with the standard indentation, no matter if you are in a PHP code block or in the HTML parts. That means, indent when opening a loop, an if or a tag. This should at least somehow clean up that table-driven-madness you have going there.
I'm working on a project that requires to convert html email into text. Below is a simplified version of the HTML code:
<table>
<tr>
<td width="10%"></td>
<td width="60%"> test product </td>
<td width="20%">5</td>
<td width="10%"> £50.00 </td>
</tr>
<tr>
<td></td>
<td colspan="3" width="100%"> Project Name: Test Project </td>
</tr>
<tr>
<td width="10%"> </td>
<td colspan="2" width="80%"> Page 1 : 01 New York 1.jpg </td>
<td width="10%"> £0.00 </td>
</tr>
</table>
The expected outcome should look like this in a text file (with columns aligned nicely):
test product 5 £50.00
Project Name: Test Project
Page 1 : 01 New York 1.jpg £0.00
My idea is parsing the HTML content by DOMDocument. Then I will set a default width for the table (i.e.: 100 spaces) then convert the width of each column from % to number of spaces (based on colspan & width attribute of <td> tag). Then I will subtract these column width to strlen of the data in each column to archive the number of spaces I need to pad_right to the string to make everything align vertically.
I have been working that way, hasn't been archived what I want but just wondering if it is stupid or anyone knows a better way please help me out.
Also when it comes to Multibyte languages (Japanese, Korean etc...) I don't think my approach would work because their characters will be bigger than one space and it end up a mess.
Can someone help me out please?
Don't reinvent the wheel. Table rendering is difficult, rendering tables using only text is even more difficult.
To clarify the complexity of a text-based table renderer that offers all the features of HTML, take a look at w3m, which is open source:
these 3000 lines of code are there only to display html tables.
Transform HTML to Text
There are textbased browsers that can be used by command line, like lynx.
You could fwrite your html table into a file, pass that file into the textbased browser and take its output.
Note: textbased browsers are generally used in a shell, which generally displays in monospace. This remains a prerequisite.
lynx and w3m are both available on Windows and you don't need to "install" them, you just need to have the executables and the permission to run them from PHP.
code example:
<?php
$table = '<table><tr><td>foo</td><td>bar</td></tr></table>'; //this contains your table
$html = "<html><body>$table</body></html>";
//write html file
$tmpfname = tempnam(sys_get_temp_dir(), "tblemail");
$handle = fopen($tmpfname, "w");
fwrite($handle, $html);
fclose($handle);
$myTextTable = shell_exec("w3m.exe -dump \"$tmpfname\"");
unlink($tmpfname);
w3m.exe needs to be in your working directory.
(didn't try it)
Render a Text table
If you want a native PHP solution, there's also at least one framework (https://github.com/c9s/CLIFramework) aimed at console applications for PHP which has a table renderer.
It doesn't transform HTML to text, but it helps you build a text formatted table with support for multiline cells (which seems to be the most complicated part).
Using CLIFramework you would need a code like this to render your table:
<?php
require 'vendor/autoload.php';
use CLIFramework\Component\Table\Table;
$table = new Table;
$table->addRow(array(
"test product", "5", "£50.00"
));
$table->addRow(array(
"Project Name: Test Project", "", ""
));
$table->addRow(array(
"Page 1 : 01 New York 1.jpg", "", "£0.00"
));
$myTextTable = $table->render();
The CLIFramework table renderer doesn't seem to support anything similar to "colspan" however.
Here's the documentation for the table component: https://github.com/c9s/CLIFramework/wiki/Using-Table-Component
I have no clue at all.
How do I extract the numeric % data on the right from the link below and display them on my website without updating daily myself? Can a simple PHP + HTML solve my problem?
http://www.mrrebates.com/merchants/all_merchants.asp
Meanwhile, how do I automatically hyperlink the extracted numeric % and display it as a link for that retailer? for example,
1 Stop Florists------------------------- 8% (this 8% should be displayed as hyperlink for that retailer, unfortunately I am too new to have more than 1 hyperlink)
at the same time integrating my referral id (shown below) on to that 8% hyperlink
mrrebates.com?refid=420149
You can use curl to download the page, then use regular expressions to parse it up and print it out in whatever form you want. Here's some PHP code to do it:
<?php
system("curl -v http://www.mrrebates.com/merchants/all_merchants.asp > /tmp/x.txt");
$data = file_get_contents("/tmp/x.txt");
preg_match_all('/<td><a href="([^"]*)".*?<b>([^<]*)<\/b>.*?<td class="r">([^<]*)<\/td>/',
$data, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$site_name = $match[2];
$url = "http://www.mrrebates.com/{$match[1]}";
$percent = $match[3];
print "<a href='$url'>$site_name</a> ";
print "<a href='$url'>$percent</a> <br/>";
}
That'll print out a list of links every time you refresh the page. I have no idea how referral codes work on that site, but I imagine it'll be pretty easy to tack it onto the $url variable.
One caveat here is that every time you refresh your page, it's going to have to load the other site first and parse it so it'll be slow. You could separate out the system("curl...") call into a separate file and only do that once an hour or so if you want to make it go faster. Good luck.
Parsing XHTML is best left to a DOM parser. However, this type of scrape operation is messy business anyway. I will propose another solution and let you piece it together.
View the source of your HTML and find out the beginning and end of your table. Looks like you want this:
<table border="0" width="95%" cellpadding="3" cellspacing="0" style="border: 1px dotted #808080;">
<tr>
<td bgcolor="#FFCC00"><b>Store Name</b></td>
<td width="75" align="center" bgcolor="#FFCC00"><b>Coupons</b></td>
<td width="75" align="right" bgcolor="#FFCC00"><b>Rebate</b></td>
</tr>
And then look for the next occurrence of </table>.
Now, your content is in rows... look for <tr and </tr>.
I'll let you figure it out how to break it down from there.
Now, do actually all of this work... there are lots of functions that can help you. Start with strpos.
This is probably better done with javascript (or at least I have usually tackled problems like this on the client-side), particularly jQuery library.
You want to load the data on that page with something like
$.get("www.mrrebates.com/merchants/allmerchants.asp");
and parse the remaining data to get the info you need (this should be simple enough jQuery will do, tho there are fuller DOM parsers). I'm not sure what you're familiar with so far but it would probably be a lot to describe here. I see the % info is in td with class "r"
Do you have just one referral ID or one for each vender? that will obviously matter