regex find specific tables in html

regex find specific tables in html - php

i have html like bottom of this. and using PHP
<table style="...">
<tbody>
<tr> <img id="foo" src="foo"/></tr>
</tbody>
</table>
<p> ....</p>
<table style="...">
<tbody>
<tr> <img id="bar" src="bar"/></tr
</tbody>
</table>
I'm beginning PHP.
I want to find specific table like img src or id equals foo or bar.
but selected both tables.
here is my regex
1.find tables has img tag
/<table.*?>.*?<img *.*?<\/table>/
-> selected 2 table
2.add img src
<table.*?<img.+(src=.*?foo).*?<\/table>
-> selected all, from first tag to last tag
3.so try to not include </table> between ... tag.
<table.*?(?!<\/table>).*?<img.+(src=.*?foo).*?<\/table>
-> same result
I don't know what is wrong!
I was solved using preg_match_all() but still want know preg_match()
has any idea??
thanks!

This job is much better suited to using PHPs DOMDocument and DOMXPath classes. In this case we use an xpath to search for a table which has a descendant which is an img with it's src attribute equal to either 'foo' or 'bar':
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$footable = $xpath->query("//table[descendant::img[#src='foo']]");
echo $footable->item(0)->C14N() . "\n";
$bartable = $xpath->query("//table[descendant::img[#src='bar']]");
echo $bartable->item(0)->C14N() . "\n";
Output:
<table style="..."><tbody><tr><img id="foo" src="foo"></img></tr></tbody></table>
<table style="..."><tbody><tr><img id="bar" src="bar"></img></tr></tbody></table>
Demo on 3v4l.org

Related

How to get content from a div using regex

I have string like :
<div class="fck_detail">
<table align="center" border="0" cellpadding="3" cellspacing="0" class="tplCaption" width="1">
<tbody>
<tr><td>
<img alt="nole-1375196668_500x0.jpg" src="http://l.f1.img.vnexpress.net/2013/07/30/nole-1375196668_500x0.jpg" width="500">
</td></tr>
<tr><td class="Image">
Djokovic hậm hực với các đàn anh. Ảnh: <em>Livetennisguide.</em>
</td></tr>
</tbody>
</table>
<p>Riêng với Andy Murray, ...</p>
<p style="text-align:right;"><strong>Anh Hào</strong></p>
</div>
I want to get content . How to write this pattern using preg_match. Please help me

If there are no other HTML tags inside the div, then this regex should work:
$v = '<div class="fck_detail">Some content here</div>';
$regex = '#<div class="fck_detail">([^<]*)</div>#';
preg_match($regex, $v, $matches);
echo $matches[1];
The actual regex here is <div class="fck_detail">([^<]*)</div>. Regexes used in PHP also need to be surrounded by some other character that doesn't occur in the regex (I used #).
However, if what you're parsing is arbitrary HTML provided by the user, then preg_match simply can't do this. Full-fledged HTML parsing is beyond the ability of any regex, and that's what you'll need if you're parsing the output of a full-fledged HTML editor.

Extract specific data from SimplePie get_content object

I have an RSS feed from which I'm trying to extract data though SimplePie (in WordPress).
I have to extract the content tag. It works with <?php echo $item->get_content(); ?>. It throws out all this stuff (of course this is just an entry, the others have the same structure):
<table><tr valign="top">
<td width="67">
<a href="http://www.anobii.com/books/Lapproccio_sistemico_al_governo_dellimpresa/9788813230944/014c5c45a7ddaab1ec/" style="border: 1px solid #333333">
<img src="http://image.anobii.com/anobi/image_book.php?type=3&item_id=014c5c45a7ddaab1ec&time=0">
</a>
</td><td style="margin-left: 10px;padding-left: 10px">[person name] put "[title]" onto shelf<br/></td></tr></table>
Though what I need is just the content inside src="" tag (image url). How can I extract only that?

You can do it using DOMDocument (the best way):
$doc = new DOMDocument();
#$doc->loadHTML($html);
$imgs = $doc->getElementsbyTagName('img');
$res = $imgs->item(0)->getAttribute('src');
print_r($res);
With a regex (the bad way):
if (preg_match('~\bsrc\s*=\s*["\']\K[^"\']*+~i', $html, $match))
print_r($match);

If I want to inspect a table row with xpath

AS part of a cURL operation I have some parsing I need to do. The data I want resides at ../table/tr/td, with said td being multiple cells containg many strings, one of which is <b>34 PT</b>, however the number is random and I cannot figure out how to just simply do a 'wildcard' or similar.
The suggestions I've found:
/tr[contains(#td, 'PT')]" );
does not return any results, nor does:
/tr/td[contains( #b, 'PT' ) ]
I've removed any kind of search at the end and it returns all of the cells as expected, so I know the data is there. The table cells that contain PT have an <a href> that I need to know.
Here is an example of the entire html:
<table>
<tr>
<td>
<tr>
<td width="120" valign="top" align="center">
<a href="submit.phtml?PT_id=86343434&xcn=b22c57866bfc2bac89b09527b05b7760&location_id=0">
<img height="80" width="80" border="1" alt="" src=".gif">
</a>
<b>3423 PT</b>
<td>
<td>
<tr>
<td> ...and so on
The xpath query was used like this:
#$dom = new DOMDocument();
#$dom->loadHTML( $rawPage );
#$xpath = new DOMXPath( $dom );
#$queryResult = $xpath->query( " //html/body/div[3]/div[3]/table/tr/td[2]/table[2]/tr/td/div/div/table/tr[2]/td/table/tr/td[contains( b, 'PT' ) ]" );

Remove your # symbol so it inspects the element values and not its attributes
ie /tr/td[contains( b, 'PT' ) ]

How to keep <p><img ... /></p> with XPATH?

I use XPATH to remove untidy HTML tags,
$nodeList = $xpath->query("//*[normalize-space(.)='' and not(self::br)]");
foreach($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
will remove the horrible input like these,
<p><em><br /></em></p>
<p><span style="text-decoration: underline;"><em><br /></em></span></p>
but it also removes the img tag like blow that I want to keep,
<p><img title="picture summit" src="images/32913430_127001_e.jpg" alt="picture summit" width="590" height="366" /></p>
How can I keep the img tag input with XPATH?

Use:
//p[not(descendant::*[self::img or self::br]) and normalize-space()='']

Maybe you could use an XPath 1.0 expression like the one below to remove unwanted paragraphs:
//p[count(text())=0 and count(img)=0]

PHP allow img tags only

I need your assistence related php. In php, i want to allow html <img> tags only, i tried php's built-in function strip_tags() but it's not giving me the output i need. For instance, in the following code strip_tags() allows img tags but along with text.
$img = "<img src='/img/fawaz.jpg' alt= ''> <br /> <p> This is a detailed paragraph about Fawaz and his mates.</p>";
echo strip_tags($img , "<img>");
What would be the proper way to just allow <img> or any tag only from the function or variable.
Any help 'd be appreciated.
Thanks

This might be due to non closing img tag in your code. Try this
$img = "<img src='/img/fawaz.jpg' alt= '' /> <br /> <p> This is a detailed paragraph about Fawaz and his mates.</p>";
echo strip_tags($img , "<img>");

strip_tags() doesn't work that way you want it to behave. If supplied with a second argument, the tags listed are allowed to be part of the resulting string - except those which are not listed. And it will not filter out inner text.
If you want to extract <img/> elements only, don't even think about using a regex. Use a DOM parser for that:
libxml_use_internal_errors(true);
$doc=new DOMDocument;
$html=$doc->loadHTML('<img src="/img/fawaz.jpg" alt= ""> <br /> <p> This is a
detailed paragraph about Fawaz and his mates.</p>');
$path=new DOMXPath($doc);
foreach ($path->query('//img') as $found)
var_dump($doc->saveXML($found));

delete HTML Tags Without <img> and <a> and <br/> and <hr/> and ...
$img = "
<img src='/img/fawaz.jpg' alt= '' />
<br /><br/>
<hr/>
<p> This is a detailed paragraph about Fawaz and his mates.</p>
<a href='cft'>123</a>
";
$img = strip_tags($img , "<img>|<a>|<br>|<hr>");
echo $img;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regex find specific tables in html - php

Related

How to get content from a div using regex

Extract specific data from SimplePie get_content object

If I want to inspect a table row with xpath

How to keep <p><img ... /></p> with XPATH?

PHP allow img tags only

Categories

Resources