Regular expression - PHP Preg Match - php

I am learning to use Regular expressions and would like to grab some data from a table:
The file looks like this:
$subject =
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>
Currently I am doing the following:
$pattern = "/<tr>.*?<td><\/td>.*?<td>(.*?)<\/td>.../s";
preg_match(
$pattern,
$subject,
$result);
This will output an array:
$result = [
0 => "tbody>...",
1 => 1,
2 => 2,
3 => 3,
4 => 4 ... n
]
This seems inefficient so I am attempting to grab a repeated pattern like so:
$pattern = "/<td>([0-9]{1,2})<\/td>/s";
This however only grabs the first number: 1
What would be the best way to go about this?

You should use preg_match_all instead of preg_match to perform the search on the entire var
http://php.net/manual/en/function.preg-match-all.php
if (preg_match_all( $pattern, $subject, $matches)) {
var_dump($matches);
}

Here's a way to accomplish this using a parser:
$subject = '
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>';
$html = new DOMDocument();
$html->loadHTML($subject);
$tds = $html->getElementsByTagName('td');
foreach($tds as $td){
echo $td->nodeValue . "\n";
if(is_numeric($td->nodeValue)) {
echo "it's a number \n";
}
}
Output:
1
it's a number
2
it's a number
3
it's a number
4
it's a number
5
it's a number
6
it's a number

To get all the values and not stopping after the first match you need to use the g flag.
In php this is implemented in the preg_match_all function.
Since the data will always be contained in a td you can do the following:
preg_match_all("/<td>(.*)<\/td>", $subject, $matches);
var_dump($matches);
Where the $subject contains you html and you should see an array of all your table data.

Related

How to scrape Data `Lamp In Box` in this string using `RegExp` in following string

How to scrape Data Lamp In Box in this string using RegExp in following string -
I want to scraped 1 Units using regexp
I have write below regexp but its not working.
Regexp - Lamp In Box: '(.*)(s)<\/td>
`<td><b>Price:</b></td> <td>
<br />Free Ground Shipping <span class="show_free_shipping" style="color:red;">[?]</span>
<br />Ship From United States
</td>
</tr>
<tr>
<td><b>Availability:</b></td>
<td>
<b style="color:blue;">In Stock</b>
</td>
</tr>
<tr>
<td><b>Lamp In Box:</b></td>
<td>1 Unit(s)</td>
</tr>
</table>
`
For catching the number of Lamp In Box you can try the next:
$string = <<your input string>>;
$pattern = '/Lamp In Box:.*\s*.*?(\d+) Unit\(s\)/i';
preg_match($pattern, $string, $result);
The $result would contain what you need.
You can try the next regular expression, if I correctly understand you: /(<td>).*Lamp In Box:.*\s*.*?(<\/td>)/i
So, you can test this by:
$string = <<your input string>>;
$pattern = '/(<td>).*Lamp In Box:.*\s*.*?(<\/td>)/i';
$replacement = '$1$2';
echo preg_replace($pattern, $replacement, $string);
This replaces all that are inside of <td></td> which contains Lamp In Box

Get all matches in a preg_match request

I'm having the following problem, i have that structure:
$table = '
<table>
<tbody>
<tr valign="top">
<td>foo</td>
<td>bar</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody>
</table>';
I'm trying to retrieve an array with all <tr> but with no success. The closest pattern I've could made it, return all messed up.
$pattern = "/<tr valign[^>]*>(.*)<\/tr>/s";
preg_match_all($pattern, $table, $matches, PREG_PATTERN_ORDER);
If i put var_dump($matches), I want an array that returns:
array(
[0] => "<td>foo</td><td>bar</td>",
[1] => "<td>bee</td><td>dog</td>"
);
...or something close to that.
But I receive:
string(301) "
foo
bar
"
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody></table>
Anyone know what I'm doing wrong?
Thanks in advance.
You must make your quantifier lazy: .* => .*?
When you use a greedy quantifier, .* will take all possible characters, When you use a lazy quantifier, .*? will take the minimum number of characters.
When you use a lazy quantifier, the regex engine will take characters one by one and test the pattern completion for each character.
When you use a greedy quantifier (default behavior) the regex engine will take all possible characters (until the end in your case) and will backtrack character by character until the pattern completion succeed.
Notes:
It is useless to add PREG_PATTERN_ORDER since it is the default set of preg_match_all.
DOMDocument is probably a more adapted tool to deal with html. Example:
$dom = new DOMDocument();
#$dom->loadHTML($table);
$trs = $dom->getElementsByTagName('tr');
$results = array();
foreach ($trs as $tr) {
if ($tr->hasAttribute('valign')) {
$children = $tr->childNodes;
$tmp = '';
foreach ($children as $child) {
$tmp .= trim($dom->saveHTML($child));
}
if (!empty($tmp)) $results[] = $tmp;
}
}
echo htmlspecialchars(print_r($results, true));

Using regexes to find result from HTML table

I am stuck with some regular expression problem.
I have a huge file in html and i need to extract some text (Model No.) from the file.
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr>
.... so on
and this is a huge page with all webpage built in table and divless...
The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.
There are about 10000 model No and i need to extract them.
is there any way do do this with regrex... like
"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"
and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.
any help would be greatly appreciated...
Description
This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.
<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>
Groups
Group 0 gets the entire td tag from open tag to close tag
gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text
PHP Code Example:
Input text
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 
<table>/.....
<td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td></tr>
Code
<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Array
(
[0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
[1] => <td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td>
)
[1] => Array
(
[0] => "
[1] => "
)
[2] => Array
(
[0] => SK10014
[1] => SK1998
)
)
Method with DOMDocument:
// $html stands for your html content
$doc = new DOMDocument();
#$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');
foreach($td_nodes as $td_node){
if ($td_node->getAttribute('class')=='thumimages')
echo $td_node->firstChild->textContent.'<br/>';
}
Method with regex:
$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class"
class \s*+ = \s*+ # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1 # "thumimages" between quotes or not
(?>[^>]++|(?<!b)>)+> # all characters until the ">" from "<b>"
\s*+ \K # any spaces and pattern reset
[^<\s]++ # all chars that are not a "<" or a space
~xi
LOD;
preg_match_all($pattern, $html, $matches);
echo '<pre>' . print_r($matches[0], true);
/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i
This works.
You can use php DOMDocument Class
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('load.html');
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $tr){
echo $xpath->query('.//td[#class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
}
?>

My regular expression doesn't know when to stop

I'm trying to match this (the name in particular):
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
Like this:
preg_match('/<th class="name">Name:<\/th>.+?<td>(.+)<\/td>/s', $a, $b);
However, while it matches the name, it doesn't stop at the end of the name. It keeps going for another 150 or so characters. Why is this? I only want to match the name.
Make last quantifier non-greedy: preg_match('/<th class="name">Name:<\/th>.+?<td>(.+?)<\/td>/s', $a, $b);
Dont use regex to parse HTML, Its very easy with DOMDocument:
<?php
$html = <<<HTML
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
<tr>
<th class="name">Somthing:</th>
<td>Foobar</td>
</tr>
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$ret = array();
foreach($dom->getElementsByTagName('tr') as $tr) {
$ret[trim($tr->getElementsByTagName('th')->item(0)->nodeValue,':')] = $tr->getElementsByTagName('td')->item(0)->nodeValue;
}
print_r($ret);
/*
Array
(
[Name] => John Smith
[Somthing] => Foobar
)
*/
?>
preg_match('/<th class="name">Name:<\/th>\s*<td>(.+?)<\/td>/s', $line, $matches);
Match only whitespace between the </th> and <td>, and non-greedy match for the name.
preg_match('/<th class="name">Name:<\/th>.+?<td>(?P<name>.*)<\/td>/s', $str, $match);
echo $match['name'];
Here is your match
preg_match(!<tr>\s*<th[^>]*>Name:</th>\s*<td>([^<]*)</td>\s*</tr>!s)
it will work perfectly.

Can I use conditional logic in PHP EOD?

Is it possible to put conditional logic inside an EOD string?
$str = <<<EOD
<table>
<tr>
<td>
if ( !empty($var1) ) {
{$var1}
} else {
{$var2}
}
</td>
</tr>
</table>
This doesn't work for me, and it sort of looks like it wouldn't work, but I thought I'd take a stab.
Also, is it EOD or EOT? Both seem to work.
No. You cannot use conditionals in heredoc.
Also, is it EOD or EOT?
As long as your beginning and ending strings match you can use anything:
$x = <<<THOMAS
Pick a string, any string
THOMAS;
The doc contains several examples demonstrating this
As to how best to achieve the example you provided, this would be my first inclination:
$td = !empty($var1) ? $var1 : $var2;
$str = <<<EOD
<table>
<tr>
<td>
{$td}
</td>
</tr>
</table>
EOD;

Categories