PHP DOM get element which contains - php

Need help with parsing HTML code by PHP DOM.
This is simple part of huge HTML code:
<table width="100%" border="0" align="center" cellspacing="3" cellpadding="0" bgcolor='#ffffff'>
<tr>
<td align="left" valign="top" width="20%">
<span class="tl">Obchodne meno:</span>
</td>
<td align="left" width="80%">
<table width="100%" border="0">
<tr>
<td width="67%">
<span class='ra'>STORE BUSSINES</span>
</td>
<td width="33%" valign='top'>
<span class='ra'>(od: 02.10.2012)</span>
</td>
</tr>
</table>
</td>
</tr>
</table>
What I need is to get text "STORE BUSINESS". Unfortunately, the only thing I can catch is "Obchodne meno" as a content of first tag, so according to this content I need to get its parent->parent->first sibling->child->child->child->child->content. I have limited experience with parsing html in php so any help will be valuable. Thanks in advance!

Make use of DOMDocument Class and loop through the <span> tags and put them in array.
<?php
$html=<<<XCOE
<table width="100%" border="0" align="center" cellspacing="3" cellpadding="0" bgcolor='#ffffff'>
<tr>
<td align="left" valign="top" width="20%">
<span class="tl">Obchodne meno:</span>
</td>
<td align="left" width="80%">
<table width="100%" border="0">
<tr>
<td width="67%">
<span class='ra'>STORE BUSSINES</span>
</td>
<td width="33%" valign='top'>
<span class='ra'>(od: 02.10.2012)</span>
</td>
</tr>
</table>
</td>
</tr>
</table>
XCOE;
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('span') as $tag) {
$spanarr[]=$tag->nodeValue;
}
echo $spanarr[1]; //"prints" STORE BUSINESS

Related

HTML Purifier for webmail

I'm working on small webmail client. For safely embedding html I want to use HTML Purifier (BTW: it's a good idea?).
I checked it with several emails and some problems. One email (from Google) is having something like this:
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="4%">
<td width="92%" style="padding-top:18px; padding-bottom:10px; opacity:0.7">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<td width="30%">
<img style="display:inline-block;" height="26" src="https://www.gstatic.com/local/guides/email/images/photo-impact/googlelogo_light_clr-f040d5d9.png">
<td>
<td width="70%" style="text-align:right">
</td>
</tbody>
</table>
Converts to:
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="4%">
</td><td width="92%" style="padding-top:18px;padding-bottom:10px;opacity:.7;">
</td><td width="30%">
<img style="display:inline-block;" height="26" src="https://www.gstatic.com/local/guides/email/images/photo-impact/googlelogo_light_clr-f040d5d9.png" alt="googlelogo_light_clr-f040d5d9.png">
</td><td>
</td><td width="70%" style="text-align:right;">
</td>
</tr></table>
I don't know why it remove second <table> tag (also it close wrong <td> and removes <tbody>). Is it possible to change HTML Purifier to make it work for those situations?

Simple Dom Parser or CURL TABLE PARSING

I need help to get the data from a table. It's an internet usage table and the html code is down below :
<table width="572" border="0" align="center" cellspacing="0">
<tbody><tr valign="top">
<td width="1" class="bgsidelines"></td>
<td width="*" class="bgbottom">
<table summary="" width="100%" border="0" cellpadding="0">
<tbody><tr>
<td width="10" rowspan="2" bgcolor="#CCCCCC"></td>
<td width="443">
<table width="443" height="10" border="0" align="center" cellpadding="8">
<tbody>
<tr>
<td width="100%" class="path"><b>Internet usage</b></td>
</tr>
<tr>
<td class="reg"><!-- Begin yours codes -->
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody><tr>
<table cellpadding="5" cellspacing="1" border="0">
<tbody>
<tr>
<td width="43" bgcolor="#EEEEEE" class="grey"><b><center>MB</center></b>
</td>
<td width="44" bgcolor="#EEEEEE" class="grey"><b><center>GB</center></b>
</td>
<td width="44" bgcolor="#EEEEEE" class="grey"><b><center>MB</center></b>
</td>
<td width="44" bgcolor="#EEEEEE" class="grey"><b><center>GB</center></b>
</td>
<td width="60" bgcolor="#EEEEEE" class="grey"><b><center>MB</center></b>
</td>
<td width="60" bgcolor="#EEEEEE" class="grey"><b><center>GB</center></b>
</td>
</tr>
<tr>
<td bgcolor="#FFFFFF" class="reg" nowrap="nowrap">2017-06-01 to<br>2017-
06-18</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">54815.06</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">53.53</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">52114.59</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">50.89</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">106929.65</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">104.42</td>
</tr>
</tbody></table></td></tr>
</tbody></table>
<!-- End yours codes -->
</tr>
</tbody></table></td></tr>
</tbody></table></td></tr>
</tbody></table>
I've done it in a way that works but only works sometimes, this must be due to the user agent. and it fetches the entire table while I would like each separated values for the internet usage, the ones in the td class="reg" (54815.06, 53.53..) It's hard because there is a table in table.. Also it's
My PHP :
require_once 'advanced_html_dom.php';
$numvl = $_POST['numvl'];
$url =
'https://extranet.videotron.com/services/secur/extranet/tpia/Usage.do?
compteInternet='.$numvl;
$html = new AdvancedHtmlDom();
$html->load_file($url);
$element = $html->find("tr");
echo $element[1]->innertext;
no need for some external lib (advanced_html_dom.php? never heard of), just use PHP's DOMDocument and DOMXPath.
example:
<?php
declare(strict_types=1);
$domd=#DOMDocument::loadHTML(getHTML());
$xpath=new DOMXPath($domd);
foreach($xpath->query("//td[#valign='top' and #class='reg']") as $ele){
var_dump($ele->textContent);
}
function getHTML():string{
$html=<<<'HTML'
<table width="572" border="0" align="center" cellspacing="0">
<tbody><tr valign="top">
<td width="1" class="bgsidelines"></td>
<td width="*" class="bgbottom">
<table summary="" width="100%" border="0" cellpadding="0">
<tbody><tr>
<td width="10" rowspan="2" bgcolor="#CCCCCC"></td>
<td width="443">
<table width="443" height="10" border="0" align="center" cellpadding="8">
<tbody>
<tr>
<td width="100%" class="path"><b>Internet usage</b></td>
</tr>
<tr>
<td class="reg"><!-- Begin yours codes -->
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tbody><tr>
<table cellpadding="5" cellspacing="1" border="0">
<tbody>
<tr>
<td width="43" bgcolor="#EEEEEE" class="grey"><b><center>MB</center></b>
</td>
<td width="44" bgcolor="#EEEEEE" class="grey"><b><center>GB</center></b>
</td>
<td width="44" bgcolor="#EEEEEE" class="grey"><b><center>MB</center></b>
</td>
<td width="44" bgcolor="#EEEEEE" class="grey"><b><center>GB</center></b>
</td>
<td width="60" bgcolor="#EEEEEE" class="grey"><b><center>MB</center></b>
</td>
<td width="60" bgcolor="#EEEEEE" class="grey"><b><center>GB</center></b>
</td>
</tr>
<tr>
<td bgcolor="#FFFFFF" class="reg" nowrap="nowrap">2017-06-01 to<br>2017-
06-18</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">54815.06</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">53.53</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">52114.59</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">50.89</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">106929.65</td>
<td bgcolor="#FFFFFF" align="right" valign="top" class="reg">104.42</td>
</tr>
</tbody></table></td></tr>
</tbody></table>
<!-- End yours codes -->
</tr>
</tbody></table></td></tr>
</tbody></table></td></tr>
</tbody></table>
HTML;
return $html;
}
output:
string(8) "54815.06"
string(5) "53.53"
string(8) "52114.59"
string(5) "50.89"
string(9) "106929.65"
string(6) "104.42"

Python regex ignore new line

I have web page look like this
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center">Title - secondname</h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i>secondname</i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center">Title - secondname</h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i>secondname</i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>
I would like to get certain values from this site using python regex.
After <div align="center"> I like to get href value: "/title/name.php" and img src: "./movie/image.jpg" and Title - secondname from <h1 align="center">Title - secondname</h1>
i have tried this:
regex = 'class="main_tb3"*\n<a href="(.+?)" target="_blank">\n<img src="(.+?)"'
please help me
you can use below regex
For href value: <a href="(.*?)"
For Image src: <img src="(.*?)"
For Title: titleid=12">(.*?)<
You will find it a lot simpler to install something like BeautifulSoup to do this:
from bs4 import BeautifulSoup
html = """
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center">Title - secondname</h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i>secondname</i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>
<td valign="top">
<table width="100%" border="0" cellspacing="2" cellpadding="1" class="main_tb3">
<tr>
<td colspan="2">
<div align="center">
<a href="/title/name.php" target="_blank">
<img src="./movie/image.jpg" alt="TitleName" border="0" height="100" width="225" />
</a>
</div>
</td>
</tr>
<tr>
<td colspan="2"><h1 align="center">Title - secondname</h1></td>
</tr>
<tr>
<td><span class="style10">Cat1 :</span></td>
<td>1st name</td>
</tr>
<tr>
<td width="32%"><span class="style10">Cat2 :</span></td>
<td width="68%"><b><i>secondname</i></b></td>
</tr>
<tr>
<td><span class="style10">cat4 :</span></td>
<td>Bla bla</td>
</tr>
<tr>
<td><span class="style10">Cat3 :</span></td>
<td>thirdName2</td>
</tr>
</table>
</td>"""
soup = BeautifulSoup(html)
for table in soup.find_all("table", class_="main_tb3"):
print table.find('a').get('href')
print table.find('h1').text
For the HTML you have given, this will print the following:
/title/name.php
Title - secondname
/title/name.php
Title - secondname

Xpath nested tables

I have a Table, see Code. Its a table that has a table in it, so its nested. Now i want to get all vales of the parent table only and then all values of the child table.
To get the childs data i can do this:
$query = '//*[#id="WordClass"]/table[2]/tr/td[2]/table/tr';
$nodes = $xpath->query($query);
foreach ($nodes as $node) { //do more querys to get the td data and save it..
My problem is how to only get the data of the parent table without getting the child data/tr/td also.
<table cellpadding="0" cellspacing="0" border="0">
<tbody>
<tr valign="top">
<td>
<table cellpadding="1" cellspacing="2" border="0">
<tr>
<td class="colTitle" align="center" colspan="4">
Da Titel
</td>
</tr>
<tr>
<td class="colTitle" align="center" colspan="2">One
</td>
<td class="colTitle" align="center" colspan="2">Two
I
</td>
</tr>
<tr>
<td class="colSubTitle">Pe</td>
<td class="colSubTitle">Ve</td>
<td class="colSubTitle">Pe</td>
<td class="colSubTitle">Ve</td>
</tr>
<tr>
<td class="rowTitle">x</td>
<td class="colVerbDef">y</td>
<td class="rowTitle">z</td>
<td class="colVerbDef">c</td>
</tr>
<tr>
<td class="rowTitle">r</td>
<td class="colVerbDef">t</td>
<td class="rowTitle">z</td>
<td class="colVerbDef">z</td>
</tr>
</table>
</td>
<td>
<table cellpadding="1" cellspacing="2" border="0">
<tr>
<td class="colTitle" align="center" colspan="4">
Da Titel2
</td>
</tr>
<tr>
<td class="colTitle" align="center" colspan="2">One
</td>
<td class="colTitle" align="center" colspan="2">Two
I
</td>
</tr>
<tr>
<td class="colSubTitle">Pe2</td>
<td class="colSubTitle">Ve2</td>
<td class="colSubTitle">Pe2</td>
<td class="colSubTitle">Ve2</td>
</tr>
<tr>
<td class="rowTitle">x2</td>
<td class="colVerbDef">y2</td>
<td class="rowTitle">z2</td>
<td class="colVerbDef">c2</td>
</tr>
<tr>
<td class="rowTitle">r2</td>
<td class="colVerbDef">t2</td>
<td class="rowTitle">z2</td>
<td class="colVerbDef">z2</td>
</tr>
</table>
</td>
</tr>
</tbody>
You can get the contents of the parent table's td elements using a direct path from the root:
/table/tbody/tr/td
The contents of those cells happen to be another table element, but you can strip those out with DOMDocument.
To get the inner tables' td elements only excluding the parents, you can look for tables that have a td parent, then select its tds:
//td/table//td
If I've misunderstood your question, please feel free to explain further and I will update.

Php HTML DOM parsing

<table width="100%" cellspacing="0" cellpadding="0" border="0" id="Table4">
<tbody>
<tr>
<td valign="top" class="tx-strong-dgrey">
<a class="anc-noul" href="http://www.example.com/catalog/proddetail.asp?logon=&langid=EN&sku_id=0665000FS10129471&catid=25653">
Apple 8GB 3rd Generation iPod Touch</a></td>
</tr>
<tr>
<td valign="top" class="element-spacer"/>
</tr>
<tr>
<td valign="top" class="tx-normal-grey">
Product detail
<a href="http://www.example.com/catalog/proddetail.asp?logon=&langid=EN&sku_id=0665000FS10129471&catid=25653">
More Info</a></td>
</tr>
<tr>
<td valign="top" class="element-spacer"/>
</tr>
<tr>
<td valign="top" class="tx-normal-red">
<span class="tx-strong-dgrey">Price:</span>
$189.99</td>
</tr>
<tr>
<td valign="top">You save: $9.00 after instant savings</td>
</tr>
<tr>
<td valign="top" class="element-spacer"/>
</tr>
<tr>
<td valign="top" class="tx-normal-grey">
<a href="http://www.example.com/catalog/subclass.asp?catid=25653&logon=&langid=EN">
View similar products</a>
<a href="http://www.example.com/catalog/mfr.asp?man=Apple&catid=19&logon=&langid=EN">
View similar products with same brand</a>
</td></tr>
<tr>
<td valign="top" class="element-spacer"/>
</tr>
</tbody>
</table>
I want to be able to get the $189.99.
echo $ret[0]->find('tr', 4)->plaintext;
This outputs: 'Price: $189.99'
I just need $189.99, not 'Price:'
$exp = explode(":", $ret[0]->find('tr', 4)->plaintext);
$price =$exp[1];

Categories