Extract content from each first TD in a Table

Extract content from each first TD in a Table - php

I've got some HTML that looks like this:
<tr class="row-even">
<td align="center">abcde</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-odd">
<td align="center">efgh</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
<tr class="row-even">
<td align="center">ijkl</td>
<td align="center"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></td>
</tr>
And I need to retrieve the values, abcde, efgh, and ijkl
This is the regex I'm currently using:
preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);
Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?
Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.
EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m
Notice the m modifier and the use of \s*.
Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

Try this:
preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);
Changes made:
You've not accounted for the newline
between the tags
You don't need to x modifier as it
will discard the space in the regex.
Make the matching non-greedy by using
.*? in place of .*.
Working link

Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.
include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);
where dom.php just contains:
// dom.php
function dom_match_all($query, $html, array $matches = array()) {
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTML($html);
libxml_clear_errors();
$xPath = new DOMXPath($dom);
foreach( $xPath->query($query) as $node ) {
$matches[] = $node->nodeValue;
}
return $matches;
}
and would return
Array
(
[0] => abcde
[1] => efgh
[2] => ijkl
)
But if you want a Regex, use a Regex. I am just giving ideas.

This is what I came up with
<td align="center">([^<]+)</td>
I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.

Disclaimer: Using regexps to parse HTML is dangerous.
To get the innerhtml of the first TD in each TR, use this regexp:
/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.
<tr[^>]+>[^\n]*\n #Match the opening <tr> tag
\s*<td[^>]+>([^<]+)[^\n]+\n #Group the wanted data
[^\n]+\n #Match next line
</tr> #Match closing tag
Here is an alternative way, which may be more robust:
deluserconfirm.html\?user=([^"]+)

Related

replacing a string and keeping numbers intact

i got this table generated with php:
a function generates a string with all the html code:
<table><tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td>6</td><td>7</td><td>8</td><td>9</td><td>10</td></tr><tr><td>2</td><td>4</td><td>6</td><td>8</td><td>10</td><td>12</td><td>14</td><td>16</td><td>18</td><td>20</td></tr><tr><td>3</td><td>6</td><td>9</td><td>12</td><td>15</td><td>18</td><td>21</td><td>24</td><td>27</td><td>30</td></tr><tr><td>4</td><td>8</td><td>12</td><td>16</td> .... </table>
now i want to make the numbers 1 to 10 black. i'm trying to replace '<td>(10|[0-9])</td>' with <td style="font-weight: bold">THE-ORIGINAL-NUMBER</td>.
Thanx in advance!
p.s. i know there're alot of similir answers out there but i just couldnt figure it out.. is there an actually noob-friendly tut/glossary of regex out there? i couldn't really find a modern day site.

If you are matching this regular expression:
<td>(10|[0-9])</td>
You are capturing 10|[0-9] into capture group #1. This can be referenced in your replacement with either of the following backreferences:
\1
$1
Full PHP code:
$html = '<td>1</td>';
$html = preg_replace(
'~<td>(10|[0-9])</td>~',
'<td style="font-weight: bold">\1</td>',
$html
);

use this regex
(?<=<td>)(10|[0-9])(?=<\/td>)
replace group #1 with:
<span class="BoldText">$1</span>
Style:
.BoldText {
font-weight: bold;
}

using <b> may be useful:
replace
'~<td>(10|[0-9])</td>~'
with
'<td><b>\1</b></td>'

Get all matches in a preg_match request

I'm having the following problem, i have that structure:
$table = '
<table>
<tbody>
<tr valign="top">
<td>foo</td>
<td>bar</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody>
</table>';
I'm trying to retrieve an array with all <tr> but with no success. The closest pattern I've could made it, return all messed up.
$pattern = "/<tr valign[^>]*>(.*)<\/tr>/s";
preg_match_all($pattern, $table, $matches, PREG_PATTERN_ORDER);
If i put var_dump($matches), I want an array that returns:
array(
[0] => "<td>foo</td><td>bar</td>",
[1] => "<td>bee</td><td>dog</td>"
);
...or something close to that.
But I receive:
string(301) "
foo
bar
"
<table>
<tbody>
<tr valign="top">
<td>bee</td>
<td>dog</td>
</tr>
</tbody></table>
Anyone know what I'm doing wrong?
Thanks in advance.

You must make your quantifier lazy: .* => .*?
When you use a greedy quantifier, .* will take all possible characters, When you use a lazy quantifier, .*? will take the minimum number of characters.
When you use a lazy quantifier, the regex engine will take characters one by one and test the pattern completion for each character.
When you use a greedy quantifier (default behavior) the regex engine will take all possible characters (until the end in your case) and will backtrack character by character until the pattern completion succeed.
Notes:
It is useless to add PREG_PATTERN_ORDER since it is the default set of preg_match_all.
DOMDocument is probably a more adapted tool to deal with html. Example:
$dom = new DOMDocument();
#$dom->loadHTML($table);
$trs = $dom->getElementsByTagName('tr');
$results = array();
foreach ($trs as $tr) {
if ($tr->hasAttribute('valign')) {
$children = $tr->childNodes;
$tmp = '';
foreach ($children as $child) {
$tmp .= trim($dom->saveHTML($child));
}
if (!empty($tmp)) $results[] = $tmp;
}
}
echo htmlspecialchars(print_r($results, true));

Using regexes to find result from HTML table

I am stuck with some regular expression problem.
I have a huge file in html and i need to extract some text (Model No.) from the file.
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>SK1998</b></td></tr>
.... so on
and this is a huge page with all webpage built in table and divless...
The class "thumimages" almost repeats in all td, so leaves no way to differentiate the require content from the page.
There are about 10000 model No and i need to extract them.
is there any way do do this with regrex... like
"/<td colspan="2" align="center" class="thumimages"><b>{[1-9]}</b></td></tr>/"
and return an array of all the matched results. Note I have tried HTML parsing but the document contains to many html validation errors.
any help would be greatly appreciated...

Description
This will match all td fields with class="thumimages" and retreive the contents of the inner b tag. The inner text need to have some value to it, and any leading or trailing spaces will be removed.
<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>
Groups
Group 0 gets the entire td tag from open tag to close tag
gets the open quote around the class value to ensure the correct closing capture is also found
get the desired text
PHP Code Example:
Input text
<table>......
<td colspan="2" align="center" class="thumimages"><b>SK10014</b></td></tr>
.......
<table>/.....
<td colspan="2" align="center" class="thumimages"><b>     </b></td></tr> 
<table>/.....
<td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td></tr>
Code
<?php
$sourcestring="your source string";
preg_match_all('/<td\b(?=\s)(?=[^>]*\s\bclass=(["'])thumimages\1)[^>]*><b>\s*(?!<)([^<\s]+)\s*<\/b><\/td>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Array
(
[0] => <td colspan="2" align="center" class="thumimages"><b>SK10014</b></td>
[1] => <td colspan="2" align="center" class="thumimages"><b> SK1998 </b></td>
)
[1] => Array
(
[0] => "
[1] => "
)
[2] => Array
(
[0] => SK10014
[1] => SK1998
)
)

Method with DOMDocument:
// $html stands for your html content
$doc = new DOMDocument();
#$doc->loadHTML($html);
$td_nodes = $doc->getElementsByTagName('td');
foreach($td_nodes as $td_node){
if ($td_node->getAttribute('class')=='thumimages')
echo $td_node->firstChild->textContent.'<br/>';
}
Method with regex:
$pattern = <<<'LOD'
~
<td (?>[^>c]++|\bc(?!lass\b))+ # begining of td tag until the word "class"
class \s*+ = \s*+ # "class=" with variable spaces around the "="
(["']?+) thumimages\b \1 # "thumimages" between quotes or not
(?>[^>]++|(?<!b)>)+> # all characters until the ">" from "<b>"
\s*+ \K # any spaces and pattern reset
[^<\s]++ # all chars that are not a "<" or a space
~xi
LOD;
preg_match_all($pattern, $html, $matches);
echo '<pre>' . print_r($matches[0], true);

/(<td colspan="2" align="center" class="thumimages"><b>)([a-z0-9]+)(</b></td></tr>)/i
This works.

You can use php DOMDocument Class
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('load.html');
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $tr){
echo $xpath->query('.//td[#class="thumimages"]', $tr)->item(0)->nodeValue.'<br/>';
}
?>

PHP web scraping

I use php web scraping, and I want to get the price (3.65) on Sunday form the html code below:
<tr class="odd">
<td >
<b>Sunday</b> Info
<div class="test">test</div>
</td>
<td>
€ 3.65 *
</td>
</tr>
But I don't find the best regex to do this...
I use this php code:
<?php
$data = file_get_contents('http://www.test.com/');
preg_match('/<tr class="odd"><td ><b>Sunday</b> Info<div class="test">test<\/div><\/td><td>€ (.*) *<\/td><\/tr>/i', $data, $matches);
$result = $matches[1];
?>
But no result... What's wrong in the regex? (I think it's because of the new lines/spaces?)

Don't use regular expressions, HTML is not regular.
Instead, use a DOM Tree parser like DOMDocument. This documentation may help you.
The /s switch should help you with your original regex though I haven't tried it.

The problems are the spaces between the tags.
there a line breaks, tabs and/or spaces.
your regex doesn't match to them.
you also need to setup your preg_match for multiline!
i think it is more easy to use xpath for scraping.

Try to replace newlines with '' and then perform the regexp again.

Try in this way:
$uri = ('http://www.test.com/');
$get = file_get_contents($uri);
$pos1 = strpos($get, "<tr class=\"odd\"><td ><b>Sunday</b> Info<div class=\"test\">test</div></td><td>€");
$pos2 = strpos($get, "*</td></tr>", $pos1);
$text = substr($get,$pos1,$pos2-$pos1);
$text1 = strip_tags($text);

Using PHP DOMDocument Object. We're going to parse the HTML DOM data from the web page
$dom = new DOMDocument();
$dom->loadHTML($data);
$trs = $dom->getElementsByTagName('tr'); // this gives us all the tr elements on the webpage
// loop through all the tr tags
foreach($trs as $tr) {
// until we get one with the class 'odd' and has a b tag value of SUNDAY
if ($tr->getAttribute('class') == 'odd' && $tr->getElementsByTagName('b')->item(0)->nodeValue == 'Sunday') {
// now set the price to the node value of the second td tag
$price = trim($tr->getElementsByTagName('td')->item(1)->nodeValue);
break;
}
}
Instead of using DOMDocument for web scraping, it's a bit tedious, you can get your hands on SimpleHtmlDomParser, it's open source.

php regex to extract data from HTML table

I'm trying to make a regex for taking some data out of a table.
the code i've got now is:
<table>
<tr>
<td>quote1</td>
<td>have you trying it off and on again ?</td>
</tr>
<tr>
<td>quote65</td>
<td>You wouldn't steal a helmet of a policeman</td>
</tr>
</table>
This I want to replace by:
quote1:have you trying it off and on again ?
quote65:You wouldn't steal a helmet of a policeman
the code that I already have written is this:
%<td>((?s).*?)</td>%
But now I'm stuck.

If you really want to use regexes (might be OK if you are really really sure your string will always be formatted like that), what about something like this, in your case :
$str = <<<A
<table>
<tr>
<td>quote1</td>
<td>have you trying it off and on again ?</td>
</tr>
<tr>
<td>quote65</td>
<td>You wouldn't steal a helmet of a policeman</td>
</tr>
</table>
A;
$matches = array();
preg_match_all('#<tr>\s+?<td>(.*?)</td>\s+?<td>(.*?)</td>\s+?</tr>#', $str, $matches);
var_dump($matches);
A few words about the regex :
<tr>
then any number of spaces
then <td>
then what you want to capture
then </td>
and the same again
and finally, </tr>
And I use :
? in the regex to match in non-greedy mode
preg_match_all to get all the matches
You then get the results you want in $matches[1] and $matches[2] (not $matches[0]) ; here's the output of the var_dump I used (I've remove entry 0, to make it shorter) :
array
0 =>
...
1 =>
array
0 => string 'quote1' (length=6)
1 => string 'quote65' (length=7)
2 =>
array
0 => string 'have you trying it off and on again ?' (length=37)
1 => string 'You wouldn't steal a helmet of a policeman' (length=42)
You then just need to manipulate this array, with some strings concatenation or the like ; for instance, like this :
$num = count($matches[1]);
for ($i=0 ; $i<$num ; $i++) {
echo $matches[1][$i] . ':' . $matches[2][$i] . '<br />';
}
And you get :
quote1:have you trying it off and on again ?
quote65:You wouldn't steal a helmet of a policeman
Note : you should add some security checks (like preg_match_all must return true, count must be at least 1, ...)
As a side note : using regex to parse HTML is generally not a really good idea ; if you can use a real parser, it should be way safer...

Tim's regex probably works, but you may want to consider using the DOM functionality of PHP instead of regex, as it may be more reliable in dealing with minor changes in the markup.
See the loadHTML method

As usual, extracting text from HTML and other non-regular languages should be done with a parser - regexes can cause problems here. But if you're certain of your data's structure, you could use
%<td>((?s).*?)</td>\s*<td>((?s).*?)</td>%
to find the two pieces of text. \1:\2 would then be the replacement.
If the text cannot span more than one line, you'd be safer dropping the (?s) bits...

Extract each content from <td>
preg_match_all("%\<td((?s).*?)</td>%", $respose, $mathes);
var_dump($mathes);

Don't use regex, use a HTML parser. Such as the PHP Simple HTML DOM Parser

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract content from each first TD in a Table - php

~<tr class="row-(even|odd)">\s<td align="center">(.?)</td>~m Notice the m modifier and the use of \s*. Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

Disclaimer: Using regexps to parse HTML is dangerous. To get the innerhtml of the first TD in each TR, use this regexp: /<tr[^>]>\s<td[^>]>(.+?)<\/td>/si

Related

replacing a string and keeping numbers intact

Get all matches in a preg_match request

Using regexes to find result from HTML table

PHP web scraping

php regex to extract data from HTML table

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract content from each first TD in a Table - php

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m Notice the m modifier and the use of \s*. Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

Disclaimer: Using regexps to parse HTML is dangerous. To get the innerhtml of the first TD in each TR, use this regexp: /<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

Related

replacing a string and keeping numbers intact

Get all matches in a preg_match request

Using regexes to find result from HTML table

PHP web scraping

php regex to extract data from HTML table

Categories

Resources

~<tr class="row-(even|odd)">\s<td align="center">(.?)</td>~m Notice the m modifier and the use of \s*. Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

Disclaimer: Using regexps to parse HTML is dangerous. To get the innerhtml of the first TD in each TR, use this regexp: /<tr[^>]>\s<td[^>]>(.+?)<\/td>/si