Regular expression missed first occurrence of target string - php

I am using regular expression to fetch both text1 and text2 in the following html code. Here is what I am using:
/<div\s?class="right-col">[\s\n\S]*<a[\s\n]?[^>]*>#(.*)<\/a>/
but apparently I missed text1, only got text2(here is the link to my problem).
<div class="right-col">
<h1>
title1
</h1>
<p>some text here</p>
<div class="some-class">
<div class="left">
<span>some text here </span>
</div>
<div class="postmeta"><a href="url-link-here" >#text1</a> </div>
</div>
<div class="right-col">
<h1>
title2
</h1>
<p>some text here</p>
<div class="some-class">
<div class="left">
<span>some text here </span>
</div>
<div class="postmeta"><a href="url-link-here" >#text2</a> </div>
</div>
Can you guys tell me what went wrong in my regular expression? Is there a better way to capture both title1, title2 and text1, text2?

Using a regular expression here is not the best way to do it. It's bad practice. You should be using a DOM/XML parser to do this.
I like using PHP's DOMDocument class. Using XPath, we can quickly find the elements you want
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$aTags = $xPath->query('//div[#class="some-class"]//a[starts-with(text(), "#")]');
foreach($aTags as $a){
echo $a->nodeValue;
}
DEMO: http://codepad.viper-7.com/QHOXzH

This is a fairly common issue with regular expressions as they are greedy. [\s\S]* (the \n is not needed) matches for the first '<' and 'a' and since it's greedy it will match those and continue. Adding a ? makes it not greedy and using your link returns both text1 and text2.
The short answer is to replace [\s\n\S]* with [\s\S]*? but as others have mentioned, this is probably not a good use of regular expressions.

Related

How to match character inside inline style?

I want to match one letter or number or symbol inside inline style.
Example:
<html>
<head>
</head>
<body>
<p style="color: #48ad64;font-weight:10px;">hi there</p>
<div style="background-color: #48ad64;">
<h3>perfect</h3>
</div>
</body>
</html>
I want to match any c or o or # or 4 or ; or -
If we take o for example, it's supposed to match 5 occurrences.
I want to replace every occurrence within a style declaration using preg_replace().
How can I get this? I tried so many different expressions, but none of them did what I want.
Some of what I tried:
/(?:\G(?!^)|\bstyle=")(?:.{0,}?)(o)(?=[^>]*>)/
/(style=")(?:\w+)(o)(([^"]*)")/
I just need the regex to match all o in my HTML. I expect this:
<html>
<head>
</head>
<body>
<p style="c'o'lor: #48ad64;f'o'nt-weight:10px;">how blabla</p>
<div style="backgr'o'und-c'o'l'o'r: #48ad64;">
<h3>perfect normal o moral bla bal</h3>
</div>
</body>
</html>
I just want all o occurrences inside inline-style above to be replaced with 'o'
A quick/dirty/simple solution is to use preg_replace_callback() with str_replace().
Pattern: (Demo with Pattern Explanation) /<[^<]+ style="\K.*?(?=">)/
Code: (Demo)
$html='<html>
<head>
</head>
<body>
<p style="color: #48ad64;font-weight:10px;">hi there</p>
<div style="background-color: #48ad64;">
<h3>perfect</h3>
</div>
</body>
</html>';
$needle="o";
echo preg_replace_callback('/<[^<]+ style="\K.*?(?=">)/',function($m)use($needle){return str_replace($needle,"<b>$needle</b>",$m[0]);},$html);
// add the i flag for case-insensitive matching------^ ^-- and add i here for case-insensitive replacing
Output:
<html>
<head>
</head>
<body>
<p style="c<b>o</b>l<b>o</b>r: #48ad64;f<b>o</b>nt-weight:10px;">hi there</p>
<div style="backgr<b>o</b>und-c<b>o</b>l<b>o</b>r: #48ad64;">
<h3>perfect</h3>
</div>
</body>
</html>
This is a pure regex replacement method/pattern:
$needle="o";
// vv-----------vv--make the needle value literal
echo preg_replace('/(?:\G(?!^)|\bstyle=")[^"]*?\K\Q'.$needle.'\E/',"'$needle'",$html);
// assumes no escaped " in style--^^^^ ^^-restart fullstring match
The [^"]*? component eliminates the need for a lookahead. However, if a font family name (or similar) were to use \" (escaped double quotes) then replacement accuracy would be negatively impacted.
I wouldn't call either of these methods "robust" because certain substrings of text may trick the pattern into "over-matching" illegitimate style substrings.
To do this properly, I suggest that you use DomDocument or some other html parser to ensure you are only modifying real/true style attributes.
DomDocument Code: (Demo)
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE
$xp = new DOMXpath($dom);
foreach ($xp->query('//*[#style]') as $node) {
$node->setAttribute('style',str_replace($needle,"'$needle'",$node->getAttribute('style'))); // no regex
}
echo $dom->saveHTML();

RegEx & PHP: last appearance with $ , only working in online tester

So I got the following HTML Code:
[et_pb_section admin_label="section"]
<h2>abcdefghijkl - abcdefghijklmno</h2>
<p> </p>
<p>abcdefghi<strong> abcdefghijkl</strong> abcdefghijklmnopqrstuvxyz.</p>
<p><br /><br /><span style="text-decoration: underline;"><strong>abcdefghij</strong></span></p>
<p><br />abcdefghijklmnopqrstuvxyz.<br /> <br /><br />
<span style="text-decoration: underline;"><strong>abcdefghi</strong>
</span></p>
<p><br />abcdefghijklmano</p>
<br /><br />
<div id="termine">
<div>
<strong>abcdefghikla - 12345678</strong>
<p>
<a href="/just-another-url.html">abcdefghijklmnopqrst <br />
abcdefghijklmnopqrst >>></a>
</p>
</div>
</div>
</div>
and I want to replace the last closing </div> with another piece of code [/et_pb_section].
So I have tried in one of the many online regex testers and cameup with this
$content= preg_replace("/<\/div>$/", $et_pb_section_ENDTAG, $content); where $et_pb_section_ENDTAG is
$et_pb_section_ENDTAG='[/et_pb_section]';.
When using the online tester everything works fine and the last </div> gets replaced but inside my php script it is not working. nothing happens, no error, no nothing. The HTML code stays the same. What am I doing wrong here?
Thank you.
EDIT: Oh, I almost forgot, when I get rid of the $ and use RegEx Option D ( matches only at the end of string) then all three closing </div> get replaced. So I guess something is wrong with the $
An example with DOMDocument, DOMXPath and DOMDocumentFragment that replaces the first div tag with an "admin_label" attribute with the value "section" (feel free to adapt to your real needs):
$html = <<<'EOD'
<div admin_label="section">
<p>abdefghij</p>
<p>klmnopqrs</p>
<p>tuvwxyz01</p>
<ul>
<li>2345</li>
<li>6789</li>
</ul>
</div>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$divNode = $xp->query('(//div[#admin_label="section"])[1]')->item(0);
$open = $dom->createTextNode('[et_pb_section admin_label="section"]');
$close = $dom->createTextNode('[/et_pb_section]');
$fragment = $dom->createDocumentFragment();
$fragment->appendChild($open);
foreach ($divNode->childNodes as $childNode) {
$fragment->appendChild($childNode->cloneNode(true));
}
$fragment->appendChild($close);
$divNode->parentNode->replaceChild($fragment, $divNode);
echo $dom->saveHTML();
libxml_clear_errors();
Note: in real life, you need to check if the XPath query returns something before continue.

Select p tag after h2 that has a child with id

How can I select a p-tag that is after a tag that has a specific child? Using a web crawler.
http://symfony.com/doc/current/components/css_selector.html
$crawler->filter('h2 span#hello + p')->each(function ($node) {
var_dump($node->html());
});
Example:
<h2><span id="hello">Hi</span></h2>
<p>I want this p-tag, that is after the h2 above</p>
<p>me too!</p>
<a>Not me!</a>
<h2>lol</h2>
<p>yo, not me</p>
does not work.
It is usually best to traverse HTML using the DOMDocument class (http://php.net/manual/en/class.domdocument.php) but you could do it with a regular expression thus:
// put the example HTML code into a string
$html = <<< EOF
<h2><span id="hello">Hi</span></h2>
<p>I want this p-tag, that is after the h2 above</p>
<p>me too!</p>
<a>Not me!</a>
<h2>lol</h2>
<p>yo, not me</p>
EOF;
// set up a regular expression
$re = "/<h2[^>]*>.*?<span[^>]*id=\"hello\"[^>]*>.*?<\\/h2[^>]*>.*?(<p.*?)<[^\\/p]/sim";
// get the match ... the (.*?) in the above regex
preg_match($re,$html,$matches);
print $matches[1];
Would output:
<p>I want this p-tag, that is after the h2 above<p>
<p>me too!</p>

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

PHP preg_match_all + str_replace

I need to find a way to replace all the <p> within all the <blockquote> before the <hr />.
Here's a sample html:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p>Good Game</p>
</blockquote>
<blockquote><p>Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>
Here's what I got:
$pieces = explode("<hr", $theHTML, 2);
$blocks = preg_match_all('/<blockquote>(.*?)<\/blockquote>/s', $pieces[0], $blockmatch);
if ($blocks) {
$t1=$blockmatch[1];
for ($j=0;$j<$blocks;$j++) {
$paragraphs = preg_match_all('/<p>/', $t1[$j], $paragraphmatch);
if ($paragraphs) {
$t2=$paragraphmatch[0];
for ($k=0;$k<$paragraphs;$k++) {
$t1[$j]=str_replace($t2[$k],'<p class=\"whatever\">',$t1[$j]);
}
}
}
}
I think I'm really close, but I don't know how to put back together the html that I just pieced out and modified.
You could try using simple_xml, or better DOMDocument (http://www.php.net/manual/en/class.domdocument.php) before you make it a valid html code, and use this functionality to find the nodes you are looking for, and replace them, for this you could try XPath (http://w3schools.com/xpath/xpath_syntax.asp).
Edit 1:
Take a look at the answer of this question:
RegEx match open tags except XHTML self-contained tags
$string = explode('<hr', $string);
$string[0] = preg_replace('/<blockquote>(.*)<p>(.*)<\/p>(.*)<\/blockquote>/sU', '<blockquote>\1<p class="whatever">\2</p>\3</blockquote>', $string[0]);
$string = $string[0] . '<hr' . $string[1];
output:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p class="whatever">Good Game</p>
</blockquote>
<blockquote><p class="whatever">Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>

Categories