My regular expression doesn't know when to stop

My regular expression doesn't know when to stop - php

I'm trying to match this (the name in particular):
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
Like this:
preg_match('/<th class="name">Name:<\/th>.+?<td>(.+)<\/td>/s', $a, $b);
However, while it matches the name, it doesn't stop at the end of the name. It keeps going for another 150 or so characters. Why is this? I only want to match the name.

Make last quantifier non-greedy: preg_match('/<th class="name">Name:<\/th>.+?<td>(.+?)<\/td>/s', $a, $b);

Dont use regex to parse HTML, Its very easy with DOMDocument:
<?php
$html = <<<HTML
<tr>
<th class="name">Name:</th>
<td>John Smith</td>
</tr>
<tr>
<th class="name">Somthing:</th>
<td>Foobar</td>
</tr>
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$ret = array();
foreach($dom->getElementsByTagName('tr') as $tr) {
$ret[trim($tr->getElementsByTagName('th')->item(0)->nodeValue,':')] = $tr->getElementsByTagName('td')->item(0)->nodeValue;
}
print_r($ret);
/*
Array
(
[Name] => John Smith
[Somthing] => Foobar
)
*/
?>

preg_match('/<th class="name">Name:<\/th>\s*<td>(.+?)<\/td>/s', $line, $matches);
Match only whitespace between the </th> and <td>, and non-greedy match for the name.

preg_match('/<th class="name">Name:<\/th>.+?<td>(?P<name>.*)<\/td>/s', $str, $match);
echo $match['name'];

Here is your match
preg_match(!<tr>\s*<th[^>]*>Name:</th>\s*<td>([^<]*)</td>\s*</tr>!s)
it will work perfectly.

Related

Regular expression - PHP Preg Match

I am learning to use Regular expressions and would like to grab some data from a table:
The file looks like this:
$subject =
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>
Currently I am doing the following:
$pattern = "/<tr>.*?<td><\/td>.*?<td>(.*?)<\/td>.../s";
preg_match(
$pattern,
$subject,
$result);
This will output an array:
$result = [
0 => "tbody>...",
1 => 1,
2 => 2,
3 => 3,
4 => 4 ... n
]
This seems inefficient so I am attempting to grab a repeated pattern like so:
$pattern = "/<td>([0-9]{1,2})<\/td>/s";
This however only grabs the first number: 1
What would be the best way to go about this?

You should use preg_match_all instead of preg_match to perform the search on the entire var
http://php.net/manual/en/function.preg-match-all.php
if (preg_match_all( $pattern, $subject, $matches)) {
var_dump($matches);
}

Here's a way to accomplish this using a parser:
$subject = '
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
</tbody>';
$html = new DOMDocument();
$html->loadHTML($subject);
$tds = $html->getElementsByTagName('td');
foreach($tds as $td){
echo $td->nodeValue . "\n";
if(is_numeric($td->nodeValue)) {
echo "it's a number \n";
}
}
Output:
1
it's a number
2
it's a number
3
it's a number
4
it's a number
5
it's a number
6
it's a number

To get all the values and not stopping after the first match you need to use the g flag.
In php this is implemented in the preg_match_all function.
Since the data will always be contained in a td you can do the following:
preg_match_all("/<td>(.*)<\/td>", $subject, $matches);
var_dump($matches);
Where the $subject contains you html and you should see an array of all your table data.

Get all matches in a preg_match request

I'm having the following problem, i have that structure:
$table = '
<table>
<tbody>
<tr valign="top">
<td>foo</td>
<td>bar</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody>
</table>';
I'm trying to retrieve an array with all <tr> but with no success. The closest pattern I've could made it, return all messed up.
$pattern = "/<tr valign[^>]*>(.*)<\/tr>/s";
preg_match_all($pattern, $table, $matches, PREG_PATTERN_ORDER);
If i put var_dump($matches), I want an array that returns:
array(
[0] => "<td>foo</td><td>bar</td>",
[1] => "<td>bee</td><td>dog</td>"
);
...or something close to that.
But I receive:
string(301) "
foo
bar
"
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody></table>
Anyone know what I'm doing wrong?
Thanks in advance.

You must make your quantifier lazy: .* => .*?
When you use a greedy quantifier, .* will take all possible characters, When you use a lazy quantifier, .*? will take the minimum number of characters.
When you use a lazy quantifier, the regex engine will take characters one by one and test the pattern completion for each character.
When you use a greedy quantifier (default behavior) the regex engine will take all possible characters (until the end in your case) and will backtrack character by character until the pattern completion succeed.
Notes:
It is useless to add PREG_PATTERN_ORDER since it is the default set of preg_match_all.
DOMDocument is probably a more adapted tool to deal with html. Example:
$dom = new DOMDocument();
#$dom->loadHTML($table);
$trs = $dom->getElementsByTagName('tr');
$results = array();
foreach ($trs as $tr) {
if ($tr->hasAttribute('valign')) {
$children = $tr->childNodes;
$tmp = '';
foreach ($children as $child) {
$tmp .= trim($dom->saveHTML($child));
}
if (!empty($tmp)) $results[] = $tmp;
}
}
echo htmlspecialchars(print_r($results, true));

Regular expression to extract the onclick value from a string

Hi I am trying to get exact value from javascript onclick.
Here is my example link:
onclick="omniture('Touchpad_8.0.7.2.ZIP','NP-N150P');downloadFile('http://xxx.com/downloadfile/ContentsFile.aspx?CDSite=UNI_CO&CttFileID=3017288&CDCttType=DR&ModelType=N&ModelName=NP-N150P&VPath=DR/201105/20110509115437867/Touchpad_8.0.7.2.ZIP','ZIP');return false;"
Lan o red inalambrica BROADCOM - 5.100.82.95 - onclick="omniture('WLAN_Broadcom_5.100.82.95.ZIP','NP-N150P');downloadFile('http://xxx.com/downloadfile/ContentsFile.aspx?CDSite=UNI_CO&CttFileID=3017290&CDCttType=DR&ModelType=N&ModelName=NP-N150P&VPath=DR/201108/20110817201634927/WLAN_Broadcom_5.100.82.95.ZIP','ZIP');return false;"
here is what I am trying:
preg_match_all(
"~onclick\s*=\s*([\"\'])(.*?)\\1~si", $d_l, $match);
$link = $match[0][0];
I am getting full onclick not the exact value, I want to get link as output:
(
http://xxx.com/downloadfile/ContentsFile.aspx?CDSite=UNI_CO&CttFileID=3017290&CDCttType=DR&ModelType=N&ModelName=NP-N150P&VPath=DR/201108/20110817201634927/WLAN_Broadcom_5.100.82.95.ZIP)
Can any one help please?

An example on how you can do this properly:
<pre><?php
$html = <<<LOD
<html><head></head><body>
<table>
<thead></thead>
<tbody id="tbodyDR">
<tr><td>bidule
bidule
</td></tr>
<tr><td>truc
truc
</td></tr>
<tr><td>bidule
machin
</td></tr>
</tbody>
</body></html>
LOD;
$doc = new DOMDocument();
//#$doc->loadHTMLFile('http://example.com/list.html');
#$doc->loadHTML($html);
$links = $doc->getElementById('tbodyDR')->getElementsByTagName("a");
foreach($links as $link) {
$onclickAttr = $link->getAttribute('onclick');
if( preg_match("~downloadFile\('\K[^']++~", $onclickAttr, $match) )
$result[] = $match[0];
}
print_r($result);

$match[0][$i-1] is the whole $i-th match, $match[1][$i-1] corresponds to the first submatch in the $i-th match, etc.
To get just the links, try this:
preg_match_all(
"~onclick\s*=\s*([\"\']).*?downloadFile\(([\"'])(.*?)\\2.*?\).*?\\1~si",
$d_l, $match
);
foreach ($matches[3] as $link)
echo $link, "<br>\n";

Using regexes to find result from HTML table

I am stuck with some regular expression problem.
I have a huge file in html and i need to extract some text (Model No.) from the file.
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr>
.... so on
and this is a huge page with all webpage built in table and divless...
The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.
There are about 10000 model No and i need to extract them.
is there any way do do this with regrex... like
"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"
and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.
any help would be greatly appreciated...

Description
This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.
<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>
Groups
Group 0 gets the entire td tag from open tag to close tag
gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text
PHP Code Example:
Input text
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 
<table>/.....
<td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td></tr>
Code
<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Array
(
[0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
[1] => <td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td>
)
[1] => Array
(
[0] => "
[1] => "
)
[2] => Array
(
[0] => SK10014
[1] => SK1998
)
)

Method with DOMDocument:
// $html stands for your html content
$doc = new DOMDocument();
#$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');
foreach($td_nodes as $td_node){
if ($td_node->getAttribute('class')=='thumimages')
echo $td_node->firstChild->textContent.'<br/>';
}
Method with regex:
$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class"
class \s*+ = \s*+ # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1 # "thumimages" between quotes or not
(?>[^>]++|(?<!b)>)+> # all characters until the ">" from "<b>"
\s*+ \K # any spaces and pattern reset
[^<\s]++ # all chars that are not a "<" or a space
~xi
LOD;
preg_match_all($pattern, $html, $matches);
echo '<pre>' . print_r($matches[0], true);

/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i
This works.

You can use php DOMDocument Class
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('load.html');
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $tr){
echo $xpath->query('.//td[#class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
}
?>

Extract content from each first TD in a Table

I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link

Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.

This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.

Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

My regular expression doesn't know when to stop - php

Make last quantifier non-greedy: preg_match('/<th class="name">Name:<\/th>.+?<td>(.+?)<\/td>/s', $a, $b);

preg_match('/<th class="name">Name:<\/th>\s*<td>(.+?)<\/td>/s', $line, $matches); Match only whitespace between the </th> and <td>, and non-greedy match for the name.

preg_match('/<th class="name">Name:<\/th>.+?<td>(?P<name>.*)<\/td>/s', $str, $match); echo $match['name'];

Here is your match preg_match(!<tr>\s<th[^>]>Name:</th>\s<td>([^<])</td>\s*</tr>!s) it will work perfectly.

Related

Regular expression - PHP Preg Match

Get all matches in a preg_match request

Regular expression to extract the onclick value from a string

Using regexes to find result from HTML table

Extract content from each first TD in a Table

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

My regular expression doesn't know when to stop - php

Make last quantifier non-greedy: preg_match('/<th class="name">Name:<\/th>.+?<td>(.+?)<\/td>/s', $a, $b);

preg_match('/<th class="name">Name:<\/th>\s*<td>(.+?)<\/td>/s', $line, $matches); Match only whitespace between the </th> and <td>, and non-greedy match for the name.

preg_match('/<th class="name">Name:<\/th>.+?<td>(?P<name>.*)<\/td>/s', $str, $match); echo $match['name'];

Here is your match preg_match(!<tr>\s*<th[^>]*>Name:</th>\s*<td>([^<]*)</td>\s*</tr>!s) it will work perfectly.

Related

Regular expression - PHP Preg Match

Get all matches in a preg_match request

Regular expression to extract the onclick value from a string

Using regexes to find result from HTML table

Extract content from each first TD in a Table

Categories

Resources

Here is your match preg_match(!<tr>\s<th[^>]>Name:</th>\s<td>([^<])</td>\s*</tr>!s) it will work perfectly.