How to read and get the ISP value from html table?
<table style="padding-top:10px;">
<tbody>
<tr>
<th>ISP:</th>
<td>My Provider</td>
</tr>
<tr><th>Organization:</th><td nowrap=""></td>
</tr>
<tr><th>Connection:</th>
</tbody></table>
Given you lack of information, a regular expression would be the easiest solution.
$matches = array();
preg_match("<th>ISP:</th>[\r\n\s\t]*<td>(.*)</td>", "<th>ISP:</th><td>My Provider</td>...", $matches);
var_dump($matches);
Related
I'm having difficulty finding the DYNAMIC-TEXT value in a sea of HTML tables.
I have tried $html->find("th[plaintext*=Type") and from here, I wanted to access the sibling, but return nothing. Here's the table structure
<table>
<tbody>
</tbody>
<colgroup>
<col width="25%">
<col>
</colgroup>
<tbody>
<tr class="odd">
<th colspan="2">Name</th>
</tr>
<tr class="even">
<th width="30%">Type</th>
<td>DYNAMIC-TEXT</td>
</tr>
</tbody>
</table>
I expect the output to be the text of DYNAMIC-TEXT but the action output is nothing
Thanks
In your code $html->find("th[plaintext*=Type") you want to use an attribute selector *= but there is no attribute plaintext.
But there is an attribute width with the value 30%. You might use a pattern ^[0-9]+%$ to check for 1+ digits followed by a percentage sign.
If you find a result, you could get the next_sibling and get the plaintext from it.
For example:
$html = str_get_html($str);
foreach ($html->find("th[width*=^[0-9]+%$]") as $value) {
echo $value->next_sibling()->plaintext;
}
Result:
DYNAMIC-TEXT
I want to extract values from the code below.
<tbody>
<tr>
<td><div class="file_pdf">note1</div></td>
<td class="textright">110 KB</td>
<td class="textright">106</td>
</tr>
<tr>
<td><div class="file_pdf">note2.pdf</div></td>
<td class="textright">44 KB</td>
<td class="textright">104</td>
</tr>
</tbody>
I want to extract 'note1', 'note2' strings and 1628 and 1629 numbers.
i treid
preg_match_all('~(\'\)\">(.*?)<\/a>)~', $getinside, $matches);
but its result is not what I am looking for..
is there any simple RegEx to extract them? Thanks!
It should work for you:
preg_match_all("~downloadFile\('(\d+)'\)\">([^<]*)</a>~", $getinside, $matches);
Remember: If your html is very large/complex and you also need to parse more other things from there, then regex is not a better option to do this.
I'm using file_get_contents to read a .html file that has a table.
<table id="someTable" style="width:100%;margin-bottom:0;">
<tr style="display:none;">
<td style="padding-left:25px;">Some text</td>
</tr>
<tr style="display:none;">
<td style="padding-left:25px;">another text</td>
</tr>
</table>
When I use preg_match_all to read the table, I get nothing when I count $matches[1]
preg_match_all('/<table id="someTable" style="width:100%;margin-bottom:0;">(.*)<\/table>/', $html, $matches);
$co = count($matches[1]);
Add modifier s to your preg_match.
preg_match_all('/<table id="someTable" style="width:100%;margin-bottom:0;">(.*)<\/table>/s', $html, $matches);
See http://ideone.com/3w0K2
preg_match_all('|<tr>(.*?)</tr>|', '<table>
<tr>
<td>oo</td>
</tr>
<tr>
<td>ddd</td>
</tr>
</table>', $matches, PREG_PATTERN_ORDER);
why this doesn't show any results.
I want to get second match $matches[1][2]
You need to use the s pattern modifier
preg_match_all('|<tr>(.*?)</tr>|s', ...
I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.
Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.
Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.
Source :
...
<TR Class="Head1">
<TD width="15%"><font size="12">Name</font></TD>
<TD>: </TD>
<TD align="center"><font color="red">Alex</font></TD>
<TD width="25%"><b>Job</b></TD>
<TD>: </B></TD>
<TD align="center" width="25%"><font color="red">Doctor</font></TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>: </TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD> </B></TD>
<TD width="40%"> </TD>
</TR>
...
As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do.
The source is longer than this.
How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else can I do?
Im waiting for your help.
If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.
You can use DOMDocument to load badly formed HTML:
$doc = new DOMDocument();
#$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>: </TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD> </B></TD>
<TD width="40%"> </TD>
</TR>');
$tds = #$doc->getElementsByTagName('td');
foreach ($tds as $td) {
echo $td->textContent, "\n";
}
I'm suppressing warnings in the above code for brevity.
Output:
Age
:
32
data
<!-- space -->
<!-- space -->
Using regex to parse HTML can be a futile effort as HTML is not a regular language.
Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.
You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.
Or you could use a parser.
Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.
$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>: </TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD> </B></TD>\s+<TD width="40%"> </TD>\s+</TR>
EOF;
preg_match_all($regex, $text, $result);
var_dump($result)