HTML Regex to Extract Data - php

I have a simple question for regex gurus. And yes... I did try several different variations of the regex before posting here. Forgive my regex ignorance. This is targeting PHP.
I have the following HTML:
<div>
<h4>
some text blah
</h4>
I need this text<br />I need this text too.<br />
</div>
<div>
<h4>
some text blah
</h4>
I need this text<br />I need this text too.<br />
</div>
<div>
<h4>
some text blah
</h4>
I need this text<br />I need this text too.<br />
</div>
What I tried that seemed most likely to work:
preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>(.*)<br \/>/', $haystack, $result);
The above returns nothing.
So then I tried this and I got the first group to match, but I have not been able to get the second.
preg_match_all('/<div><h4><a href=".*">.*<\/a><\/h4>(.*)<br \/>/', $haystack, $result);
Thank you!

Regex is great. But, some things are best tackled with a parser. Markup is one such example.
Instead of using regex, I'd use an HTML parser, like http://simplehtmldom.sourceforge.net/
However, if you insist on using regex for this specific case, you can use this pattern:
if (preg_match('%</h4>(\\r?\\n)\\s+(.*?)(<br />)(.*?)(<br />)%', $subject, $regs)) {
$first_text_string = $regs[2];
$second_text_string = $regs[4];
} else {
//pattern not found
}

I highly recommend using DOM and XPath for this.
$doc = new DOMDocument;
#$doc->loadHTML($html);
$xp = new DOMXPath($doc);
foreach($xp->query('//div/text()') as $n) {
list($before, $after) = explode('<br />', trim($n->wholeText));
echo $before . "\n" . $after;
}
But If you still decide to take the regex route, this will work for you.
preg_match_all('#</h4>\s*([^<]+)<br />([^<]+)#', $str, $matches);

This will do what you want given the exact input you provided. If you need something more generic please let me know.
(.*)<br\s*\/>(.*)<br\s*\/>
See here for a live demo http://www.phpliveregex.com/p/1i3

Related

Get text between 2 tags that change (regex)(php)

How should I get the text between 2 html tags that are not always the same. How should I let regex "ignore" a part.
Lets say this is my html:
<html>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">stirng 1</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl04_lblName">string 2</span>
...
<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl53_lblName">string 3</span>
...
</html>
As you see the ctlxx part is not always the same, this code only gets the first string:
preg_match('#\\<span id="ctl00_ContentPlaceHolder1_gvDomain_ctl03_lblName">(.+)\\</span>#s',$html,$matches);
$match = $matches[0];
echo $match;
How can I let regex ignore the ctlxx part and echo all the strings?
Thanks in advance
You can do it by DomDocument and DomXpath with using preg_match
$dom = new DOMDocument();
$dom->loadHTML($str);
$x = new DOMXpath($dom);
// Next two string to use Php functions within within Xpath expression
$x->registerNamespace("php", "http://php.net/xpath");
$x->registerPHPFunctions();
// Select span tags with proper id
foreach($x->query('//span[php:functionString("preg_match", "/ctl00_ContentPlaceHolder1_gvDomain_ctl\d+_lblName/", .)]') as $node)
echo $node->nodeValue;
If you want to solve it using regular expression then you can do something like this
<?php
preg_match('/<span id="[^"]*">(.+)<\/span>/is',$html,$matches);
$match = $matches[0];
echo $match;

Regex match full hyperlink only with certain class

I have a string that has some hyperlinks inside. I want to match with regex only certain link from all of them. I can't know if the href or the class comes first, it may be vary.
This is for example a sting:
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
I want to select from the aboce string only the one that has the class nextpostslink
So, the match in this example should return this -
»eee
This regex is the most close I could get -
/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/
But it is selecting the links from the start of the string.
I think my problem is in the (.*) , but I can't figure out how to change this to select only the needed link.
I would appreciate your help.
It's much better to use a genuine HTML parser for this. Abandon all attempts to use regular expressions on HTML.
Use PHP's DOMDocument instead:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
foreach ($dom->getElementsByTagName('a') as $link) {
$classes = explode(' ', $link->getAttribute('class'));
if (in_array('nextpostslink', $classes)) {
// $link has the class "nextpostslink"
}
}
Not sure if that's what you're but anyway: it's a bad idea to parse html with regex. Use a xpath implementation in order to reach the desired elements. The following xpath expression would give you all the 'a' elements with class "nextpostlink" :
//a[contains(#class,"nextpostslink")]
There are loads of xpath info around, since you didn't mention your programming language here goes a quick xpath tutorial using java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
Edit:
php + xpath + html: http://dev.juokaz.com/php/web-scraping-with-php-and-xpath
This would work in php:
/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m
This is of course assuming that the class attribute always comes after the href attribute.
This is a code snippet:
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
echo "URL: " . $matches[2] . "\n";
echo "Text: " . $matches[6] . "\n";
}
I would however suggest first matching the link and then getting the url so that the order of the attributes doesn't matter:
<?php
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$link = $matches[0];
$text = $matches[4];
$regexp = "/href=(\"|')([^'\"]*)(\"|')/";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$url = $matches[2];
echo "URL: $url\n";
echo "Text: $text\n";
}
}
You could of course extend the regexp by matching one of the both variants (class first vs href first) but it would be very long and I don't think it would be a performance increase.
Just as a proof of concept I created a regexp that doesn't care about the order:
/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m
The text will be in group 12 and the URL will be in either group 3 or group 10 depending on the order.
As the question is to get it by regex, here is how <a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>.
It doesn't matter in which order are the attributs and it also consider simple or double quotes.
Check the regex online: https://regex101.com/r/DX03KD/1/
I replaced the (.*) with [^'"]+ as follows:
<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>
Note: I tried this with RegEx Buddy so I didnt need to escape the <>'s or /

Take a part of a big text

Let's say we have a string ($text)
I will help you out, if <b>you see this message and never forget</b> blah blah blah
I want to take text from "<b>" to "</b>" into a new string($text2)
How can this be done?
I appreciate any help I can get. Thanks!
Edit:
I want to take a code like this.
<embed type="application/x-shockwave-flash"></embed>
If you only wish the first match and do not want to match something like <b class=">, the following will work:
UPDATED for comment:
$text = "I will help you out, if <b>you see this message and never forget</b> blah blah blah";
$matches = array();
preg_match('#<b>.*?</b>#s', $text, $matches);
if ($matches) {
$text2 = $matches[0];
// Do something with $text2
}
else {
// The string wasn't found, so do something else.
}
But for something more complex, you really should parse it as DOM per Marc B.'s comment.
Use this bad mofo: http://fr2.php.net/domdocument
$dom = new DOMDocument();
$dom->loadHTML($text);
$xpath = new DOMXpath($dom);
$nodes = $xpath->query('//b');
Here you can either loop through each one, or if you know there is only one, just grab the value.
$text1 = $nodes->item(0)->nodeValue;
strip_tags($text, '<b>');
will extract only the parts of the string between <b> </b>
If it is the behavior you look for.

regex php: find everything in div

I'm trying to find eveything inside a div using regexp. I'm aware that there probably is a smarter way to do this - but I've chosen regexp.
so currently my regexp pattern looks like this:
$gallery_pattern = '/<div class="gallery">([\s\S]*)<\/div>/';
And it does the trick - somewhat.
The problem is if i have two divs after each other - like this.
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
I want to extract the information from both divs, but my problem, when testing, is that im not getting the text in between as a result but instead:
"text to extract here </div>
<div class="gallery">text to extract from here as well"
So to sum up. It skips the first end of the div. and continues on to the next.
The text inside the div can contain <, / and linebreaks. just so you know!
Does anyone have a simple solution to this problem? Im still a regexp novice.
You shouldn't be using regex to parse HTML when there's a convenient DOM library:
$str = '
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName('div');
if ( count($divs ) ) {
foreach ( $divs as $div ) {
echo $div->nodeValue . '<br>';
}
}
What about something like this :
$str = <<<HTML
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
HTML;
$matches = array();
preg_match_all('#<div[^>]*>(.*?)</div>#s', $str, $matches);
var_dump($matches[1]);
Note the '?' in the regex, so it is "not greedy".
Which will get you :
array
0 => string 'text to extract here' (length=20)
1 => string 'text to extract from here as well' (length=33)
This should work fine... If you don't have imbricated divs ; if you do... Well... actually : are you really sure you want to use rational expressions to parse HTML, which is quite not that rational itself ?
A possible answer to this problem can be found at http://simplehtmldom.sourceforge.net/
That class help me to solve similar problem quickly

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories