Using regexes to find result from HTML table

Using regexes to find result from HTML table - php

I am stuck with some regular expression problem.
I have a huge file in html and i need to extract some text (Model No.) from the file.
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr>
.... so on
and this is a huge page with all webpage built in table and divless...
The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.
There are about 10000 model No and i need to extract them.
is there any way do do this with regrex... like
"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"
and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.
any help would be greatly appreciated...

Description
This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.
<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>
Groups
Group 0 gets the entire td tag from open tag to close tag
gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text
PHP Code Example:
Input text
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 
<table>/.....
<td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td></tr>
Code
<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Array
(
[0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
[1] => <td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td>
)
[1] => Array
(
[0] => "
[1] => "
)
[2] => Array
(
[0] => SK10014
[1] => SK1998
)
)

Method with DOMDocument:
// $html stands for your html content
$doc = new DOMDocument();
#$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');
foreach($td_nodes as $td_node){
if ($td_node->getAttribute('class')=='thumimages')
echo $td_node->firstChild->textContent.'<br/>';
}
Method with regex:
$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class"
class \s*+ = \s*+ # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1 # "thumimages" between quotes or not
(?>[^>]++|(?<!b)>)+> # all characters until the ">" from "<b>"
\s*+ \K # any spaces and pattern reset
[^<\s]++ # all chars that are not a "<" or a space
~xi
LOD;
preg_match_all($pattern, $html, $matches);
echo '<pre>' . print_r($matches[0], true);

/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i
This works.

You can use php DOMDocument Class
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('load.html');
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $tr){
echo $xpath->query('.//td[#class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
}
?>

Related

Regular expression - PHP Preg Match

I am learning to use Regular expressions and would like to grab some data from a table:
The file looks like this:
$subject =
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>
Currently I am doing the following:
$pattern = "/<tr>.*?<td><\/td>.*?<td>(.*?)<\/td>.../s";
preg_match(
$pattern,
$subject,
$result);
This will output an array:
$result = [
0 => "tbody>...",
1 => 1,
2 => 2,
3 => 3,
4 => 4 ... n
]
This seems inefficient so I am attempting to grab a repeated pattern like so:
$pattern = "/<td>([0-9]{1,2})<\/td>/s";
This however only grabs the first number: 1
What would be the best way to go about this?

You should use preg_match_all instead of preg_match to perform the search on the entire var
http://php.net/manual/en/function.preg-match-all.php
if (preg_match_all( $pattern, $subject, $matches)) {
var_dump($matches);
}

Here's a way to accomplish this using a parser:
$subject = '
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>';
$html = new DOMDocument();
$html->loadHTML($subject);
$tds = $html->getElementsByTagName('td');
foreach($tds as $td){
echo $td->nodeValue . "\n";
if(is_numeric($td->nodeValue)) {
echo "it's a number \n";
}
}
Output:
1
it's a number
2
it's a number
3
it's a number
4
it's a number
5
it's a number
6
it's a number

To get all the values and not stopping after the first match you need to use the g flag.
In php this is implemented in the preg_match_all function.
Since the data will always be contained in a td you can do the following:
preg_match_all("/<td>(.*)<\/td>", $subject, $matches);
var_dump($matches);
Where the $subject contains you html and you should see an array of all your table data.

My regular expression doesn't know when to stop

I'm trying to match this (the name in particular):
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
Like this:
preg_match('/<th class="name">Name:<\/th>.+?<td>(.+)<\/td>/s', $a, $b);
However, while it matches the name, it doesn't stop at the end of the name. It keeps going for another 150 or so characters. Why is this? I only want to match the name.

Make last quantifier non-greedy: preg_match('/<th class="name">Name:<\/th>.+?<td>(.+?)<\/td>/s', $a, $b);

Dont use regex to parse HTML, Its very easy with DOMDocument:
<?php
$html = <<<HTML
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
<tr>
<th class="name">Somthing:</th>
<td>Foobar</td>
</tr>
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$ret = array();
foreach($dom->getElementsByTagName('tr') as $tr) {
$ret[trim($tr->getElementsByTagName('th')->item(0)->nodeValue,':')] = $tr->getElementsByTagName('td')->item(0)->nodeValue;
}
print_r($ret);
/*
Array
(
[Name] => John Smith
[Somthing] => Foobar
)
*/
?>

preg_match('/<th class="name">Name:<\/th>\s*<td>(.+?)<\/td>/s', $line, $matches);
Match only whitespace between the </th> and <td>, and non-greedy match for the name.

preg_match('/<th class="name">Name:<\/th>.+?<td>(?P<name>.*)<\/td>/s', $str, $match);
echo $match['name'];

Here is your match
preg_match(!<tr>\s*<th[^>]*>Name:</th>\s*<td>([^<]*)</td>\s*</tr>!s)
it will work perfectly.

Unusual behaviour of regex

My Setup:
index.php:
<?php
$page = file_get_contents('a.html');
$arr = array();
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
print_r($arr);
?>
a.html:
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Output:
Array
(
[0] => Array
(
)
)
If I change the line 4 of index.php to:
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
The output is:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
)
I can't make out what's wrong. Please help me match the content between <td class="myclass"> and </td>.

Your code appears to work. I edited the regex to use a different separator and get a clearer view. You may want to use the ungreedy modifier in case there is more than one myclass TD in your HTML.
I have not been able to reproduce the "array of array" behaviour you note, unless I manipulate the code to add an error -- see at bottom.
<?php
$page = <<<PAGE
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
PAGE;
preg_match('#<td class="myclass">(.*)</td>#s',$page,$arr);
print_r($arr);
?>
returns, as expected:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</td>
[1] =>
THE
CONTENT
)
The code below is similar to yours but has been modified to cause an identical error. Doesn't seem likely you did this, though. The regexp is modified in order to not match, and the resulting empty array is stored into $arr[0] instead of $arr.
preg_match('#<td class="myclass">(.*)</ td>#s',$page,$arr[0]);
Returns the same error you observe:
Array
(
[0] => Array
(
)
)
I can duplicate the same behaviour you observe (works with </t, does not work with </td>) if I use your regexp, but modify the HTML to have </t d>. I still need to write to $arr[0] instead of $arr if I also want to get an identical output.

Do you understand that the 3rd paramter of preg_match is the matches and it will contain the match then the other elements will show the captured pattern.
http://ca3.php.net/manual/en/function.preg-match.php
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
This code
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
When applied on
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Will return the match in $arr[0] and the result of (.*) in $arr[1]. This result is correct: There is your content in [1]
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
Example two
<?php
header('Content-Type: text/plain');
$page = 'A B C D E F';
$arr = array();
preg_match('/C (D) E/', $page, $arr);
print_r($arr);
Example output
Array
(
[0] => C D E // This is the string found
[1] => D // this is what I wanted to look for and extracted out of [0], the matched parenthesis
)

Your regex seems correct. Isn't the syntax of preg_match as follows?
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
The | in the regex represents or

Extract content from each first TD in a Table

I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link

Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.

This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.

Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)

explode glitch in delimiter

I'm having some trouble with delimiter for explode. I have a rather chunky string as a delimiter, and it seems it breaks down when I add another letter (start of a word), but it doesn't get fixed when I remove first letter, which would indicate it isn't about lenght.
To wit, the (working) code is:
$boom = htmlspecialchars("<td width=25 align=\"center\" ");
$arr[1] = explode($boom, $arr[1]);
The full string I'd like to use is <td width=25 align=\"center\" class=\", and when I start adding in class, explode breaks down, and nothing gets done. That happens as soon as I add c, and it doesn't go away if I remove <, which it would if it's just a matter of string lenght.
Basically, the problem isn't dire, since I can just replace class=" with "" after the explode, and get the same result, but this has given me headaches to diagnose, and it seems like a really wierd problem. For what it's worth, I'm using PHP 5.3.0 in XAMPP 1.7.2.
Thanks in advance!

You could try converting every occurrence of the delimiter in the original string
"<td width=25 align=\"center\" "
in something more manageable like:
"banana"
and then explode on that word

Have you tried adding htmlspecialchars to the explode.
$arr[1] = explode($boom, htmlspecialchars($arr[1]));
I get unexpected results without it, but with it it works perfectly.
$s = '<td width=25 align="center" class="asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>';
$boom = htmlspecialchars("<td width=25 align=\"center\" class=");
$sex = explode($boom, $s);
print_r($sex);
Outputs:
Array
(
[0] => <td width=25 align="center" class="asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>
)
Whereas
$s = '<td width=25 align="center" class="asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>';
$boom = htmlspecialchars("<td width=25 align=\"center\" class=");
$sex = explode($boom, htmlspecialchars($s));
print_r($sex);
Outputs
Array
(
[0] =>
[1] => "asdjasd">sdadasd</td><td width=25 align="center" >asdasD</td>
)
This is because $boom is htmlspecialchar encoded, < and > get transformed into < and >, which it cannot find the in the string, so it just returns the whole string.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using regexes to find result from HTML table - php

/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i This works.

You can use php DOMDocument Class <?php $dom = new DOMDocument(); #$dom->loadHTMLFile('load.html'); $xpath = new DOMXPath($dom); foreach($xpath->query('//tr') as $tr){ echo $xpath->query('.//td[#class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>'; } ?>

Related

Regular expression - PHP Preg Match

My regular expression doesn't know when to stop

Unusual behaviour of regex

Extract content from each first TD in a Table

explode glitch in delimiter

Categories

Resources