I'm having the following problem, i have that structure:
$table = '
<table>
<tbody>
<tr valign="top">
<td>foo</td>
<td>bar</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody>
</table>';
I'm trying to retrieve an array with all <tr> but with no success. The closest pattern I've could made it, return all messed up.
$pattern = "/<tr valign[^>]*>(.*)<\/tr>/s";
preg_match_all($pattern, $table, $matches, PREG_PATTERN_ORDER);
If i put var_dump($matches), I want an array that returns:
array(
[0] => "<td>foo</td><td>bar</td>",
[1] => "<td>bee</td><td>dog</td>"
);
...or something close to that.
But I receive:
string(301) "
foo
bar
"
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody></table>
Anyone know what I'm doing wrong?
Thanks in advance.
You must make your quantifier lazy: .* => .*?
When you use a greedy quantifier, .* will take all possible characters, When you use a lazy quantifier, .*? will take the minimum number of characters.
When you use a lazy quantifier, the regex engine will take characters one by one and test the pattern completion for each character.
When you use a greedy quantifier (default behavior) the regex engine will take all possible characters (until the end in your case) and will backtrack character by character until the pattern completion succeed.
Notes:
It is useless to add PREG_PATTERN_ORDER since it is the default set of preg_match_all.
DOMDocument is probably a more adapted tool to deal with html. Example:
$dom = new DOMDocument();
#$dom->loadHTML($table);
$trs = $dom->getElementsByTagName('tr');
$results = array();
foreach ($trs as $tr) {
if ($tr->hasAttribute('valign')) {
$children = $tr->childNodes;
$tmp = '';
foreach ($children as $child) {
$tmp .= trim($dom->saveHTML($child));
}
if (!empty($tmp)) $results[] = $tmp;
}
}
echo htmlspecialchars(print_r($results, true));
Related
How to scrape Data Lamp In Box in this string using RegExp in following string -
I want to scraped 1 Units using regexp
I have write below regexp but its not working.
Regexp - Lamp In Box: '(.*)(s)<\/td>
`<td><b>Price:</b></td> <td>
<br />Free Ground Shipping <span class="show_free_shipping" style="color:red;">[?]</span>
<br />Ship From United States
</td>
</tr>
<tr>
<td><b>Availability:</b></td>
<td>
<b style="color:blue;">In Stock</b>
</td>
</tr>
<tr>
<td><b>Lamp In Box:</b></td>
<td>1 Unit(s)</td>
</tr>
</table>
`
For catching the number of Lamp In Box you can try the next:
$string = <<your input string>>;
$pattern = '/Lamp In Box:.*\s*.*?(\d+) Unit\(s\)/i';
preg_match($pattern, $string, $result);
The $result would contain what you need.
You can try the next regular expression, if I correctly understand you: /(<td>).*Lamp In Box:.*\s*.*?(<\/td>)/i
So, you can test this by:
$string = <<your input string>>;
$pattern = '/(<td>).*Lamp In Box:.*\s*.*?(<\/td>)/i';
$replacement = '$1$2';
echo preg_replace($pattern, $replacement, $string);
This replaces all that are inside of <td></td> which contains Lamp In Box
I am learning to use Regular expressions and would like to grab some data from a table:
The file looks like this:
$subject =
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>
Currently I am doing the following:
$pattern = "/<tr>.*?<td><\/td>.*?<td>(.*?)<\/td>.../s";
preg_match(
$pattern,
$subject,
$result);
This will output an array:
$result = [
0 => "tbody>...",
1 => 1,
2 => 2,
3 => 3,
4 => 4 ... n
]
This seems inefficient so I am attempting to grab a repeated pattern like so:
$pattern = "/<td>([0-9]{1,2})<\/td>/s";
This however only grabs the first number: 1
What would be the best way to go about this?
You should use preg_match_all instead of preg_match to perform the search on the entire var
http://php.net/manual/en/function.preg-match-all.php
if (preg_match_all( $pattern, $subject, $matches)) {
var_dump($matches);
}
Here's a way to accomplish this using a parser:
$subject = '
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>';
$html = new DOMDocument();
$html->loadHTML($subject);
$tds = $html->getElementsByTagName('td');
foreach($tds as $td){
echo $td->nodeValue . "\n";
if(is_numeric($td->nodeValue)) {
echo "it's a number \n";
}
}
Output:
1
it's a number
2
it's a number
3
it's a number
4
it's a number
5
it's a number
6
it's a number
To get all the values and not stopping after the first match you need to use the g flag.
In php this is implemented in the preg_match_all function.
Since the data will always be contained in a td you can do the following:
preg_match_all("/<td>(.*)<\/td>", $subject, $matches);
var_dump($matches);
Where the $subject contains you html and you should see an array of all your table data.
I'm trying to match this (the name in particular):
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
Like this:
preg_match('/<th class="name">Name:<\/th>.+?<td>(.+)<\/td>/s', $a, $b);
However, while it matches the name, it doesn't stop at the end of the name. It keeps going for another 150 or so characters. Why is this? I only want to match the name.
Make last quantifier non-greedy: preg_match('/<th class="name">Name:<\/th>.+?<td>(.+?)<\/td>/s', $a, $b);
Dont use regex to parse HTML, Its very easy with DOMDocument:
<?php
$html = <<<HTML
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
<tr>
<th class="name">Somthing:</th>
<td>Foobar</td>
</tr>
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$ret = array();
foreach($dom->getElementsByTagName('tr') as $tr) {
$ret[trim($tr->getElementsByTagName('th')->item(0)->nodeValue,':')] = $tr->getElementsByTagName('td')->item(0)->nodeValue;
}
print_r($ret);
/*
Array
(
[Name] => John Smith
[Somthing] => Foobar
)
*/
?>
preg_match('/<th class="name">Name:<\/th>\s*<td>(.+?)<\/td>/s', $line, $matches);
Match only whitespace between the </th> and <td>, and non-greedy match for the name.
preg_match('/<th class="name">Name:<\/th>.+?<td>(?P<name>.*)<\/td>/s', $str, $match);
echo $match['name'];
Here is your match
preg_match(!<tr>\s*<th[^>]*>Name:</th>\s*<td>([^<]*)</td>\s*</tr>!s)
it will work perfectly.
Is it possible to put conditional logic inside an EOD string?
$str = <<<EOD
<table>
<tr>
<td>
if ( !empty($var1) ) {
{$var1}
} else {
{$var2}
}
</td>
</tr>
</table>
This doesn't work for me, and it sort of looks like it wouldn't work, but I thought I'd take a stab.
Also, is it EOD or EOT? Both seem to work.
No. You cannot use conditionals in heredoc.
Also, is it EOD or EOT?
As long as your beginning and ending strings match you can use anything:
$x = <<<THOMAS
Pick a string, any string
THOMAS;
The doc contains several examples demonstrating this
As to how best to achieve the example you provided, this would be my first inclination:
$td = !empty($var1) ? $var1 : $var2;
$str = <<<EOD
<table>
<tr>
<td>
{$td}
</td>
</tr>
</table>
EOD;
I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">
~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)
Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link
Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.
This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.
Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si
This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)