How to match character inside inline style? - php

I want to match one letter or number or symbol inside inline style.
Example:
<html>
<head>
</head>
<body>
<p style="color: #48ad64;font-weight:10px;">hi there</p>
<div style="background-color: #48ad64;">
<h3>perfect</h3>
</div>
</body>
</html>
I want to match any c or o or # or 4 or ; or -
If we take o for example, it's supposed to match 5 occurrences.
I want to replace every occurrence within a style declaration using preg_replace().
How can I get this? I tried so many different expressions, but none of them did what I want.
Some of what I tried:
/(?:\G(?!^)|\bstyle=")(?:.{0,}?)(o)(?=[^>]*>)/
/(style=")(?:\w+)(o)(([^"]*)")/
I just need the regex to match all o in my HTML. I expect this:
<html>
<head>
</head>
<body>
<p style="c'o'lor: #48ad64;f'o'nt-weight:10px;">how blabla</p>
<div style="backgr'o'und-c'o'l'o'r: #48ad64;">
<h3>perfect normal o moral bla bal</h3>
</div>
</body>
</html>
I just want all o occurrences inside inline-style above to be replaced with 'o'

A quick/dirty/simple solution is to use preg_replace_callback() with str_replace().
Pattern: (Demo with Pattern Explanation) /<[^<]+ style="\K.*?(?=">)/
Code: (Demo)
$html='<html>
<head>
</head>
<body>
<p style="color: #48ad64;font-weight:10px;">hi there</p>
<div style="background-color: #48ad64;">
<h3>perfect</h3>
</div>
</body>
</html>';
$needle="o";
echo preg_replace_callback('/<[^<]+ style="\K.*?(?=">)/',function($m)use($needle){return str_replace($needle,"<b>$needle</b>",$m[0]);},$html);
// add the i flag for case-insensitive matching------^ ^-- and add i here for case-insensitive replacing
Output:
<html>
<head>
</head>
<body>
<p style="c<b>o</b>l<b>o</b>r: #48ad64;f<b>o</b>nt-weight:10px;">hi there</p>
<div style="backgr<b>o</b>und-c<b>o</b>l<b>o</b>r: #48ad64;">
<h3>perfect</h3>
</div>
</body>
</html>
This is a pure regex replacement method/pattern:
$needle="o";
// vv-----------vv--make the needle value literal
echo preg_replace('/(?:\G(?!^)|\bstyle=")[^"]*?\K\Q'.$needle.'\E/',"'$needle'",$html);
// assumes no escaped " in style--^^^^ ^^-restart fullstring match
The [^"]*? component eliminates the need for a lookahead. However, if a font family name (or similar) were to use \" (escaped double quotes) then replacement accuracy would be negatively impacted.
I wouldn't call either of these methods "robust" because certain substrings of text may trick the pattern into "over-matching" illegitimate style substrings.
To do this properly, I suggest that you use DomDocument or some other html parser to ensure you are only modifying real/true style attributes.
DomDocument Code: (Demo)
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE
$xp = new DOMXpath($dom);
foreach ($xp->query('//*[#style]') as $node) {
$node->setAttribute('style',str_replace($needle,"'$needle'",$node->getAttribute('style'))); // no regex
}
echo $dom->saveHTML();

Related

Extract tag attributes from HTML with regex

I want to read all tag attributes with the word title, HTML sample below
<html>
<head>
<title> </title>
</head>
<body>
<div title="abc"> </div>
<div>
<span title="abcd"> </span>
</div>
<input type="text" title="abcde">
</body>
</html>
I have tried this regex function, which doesn't work
preg_match('\btitle="\S*?"\b', $html, $matches);
Just to follow up on my comment, using regex's isn't particularly safe or robust enough to manage HTML (although with some HTML - there is little hope of anything working fully) - have a read of https://stackoverflow.com/a/1732454/1213708.
Using DOMDocument provides a more reliable method, to do the processing you are after you can use XPath and search for any title attributes using //#title (the # sign is the XPath notation for attribute).
$html = '<html>
<head>
<title> </title>
</head>
<body>
<div title="abc"> </div>
<div>
<span title="abcd"> </span>
</div>
<input type="text" title="abcde">
</body>
</html>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath->query('//#title') as $link) {
echo $link->textContent.PHP_EOL;
}
which outputs...
abc
abcd
abcde
Here's a regex solution
preg_match_all('~\s+title\s*=\s*["\'](?P<title>[^"]*?)["\']~', $html, $matches);
$matches = array_pop($matches);
foreach($matches as $m){
echo $m . " ";
}

How can I select only the immediate parent node of a text string using xpath for every match

Note: this differs from the following question in that here we have values appearing within a node and within a childnode of that same node:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
Given the following html:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
And the following xpath:
//*[contains(text(),'interim')]
... only provides 3 matches, whereas I want four matches. As per comments, the four elements I'm expecting are P P A LI.
This works exactly as expected. See this glot.io link.
<?php
$html = <<<HTML
<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//*/text()[contains(.,"interim")]') as $n) var_dump($n->getNodePath());
You will get four matches:
/html/body/div[1]/p/text()
/html/body/div[2]/p/a/text()
/html/body/div[2]/p/text()[2]
/html/body/div[3]/ul/li/text()

Select p tag after h2 that has a child with id

How can I select a p-tag that is after a tag that has a specific child? Using a web crawler.
http://symfony.com/doc/current/components/css_selector.html
$crawler->filter('h2 span#hello + p')->each(function ($node) {
var_dump($node->html());
});
Example:
<h2><span id="hello">Hi</span></h2>
<p>I want this p-tag, that is after the h2 above</p>
<p>me too!</p>
<a>Not me!</a>
<h2>lol</h2>
<p>yo, not me</p>
does not work.
It is usually best to traverse HTML using the DOMDocument class (http://php.net/manual/en/class.domdocument.php) but you could do it with a regular expression thus:
// put the example HTML code into a string
$html = <<< EOF
<h2><span id="hello">Hi</span></h2>
<p>I want this p-tag, that is after the h2 above</p>
<p>me too!</p>
<a>Not me!</a>
<h2>lol</h2>
<p>yo, not me</p>
EOF;
// set up a regular expression
$re = "/<h2[^>]*>.*?<span[^>]*id=\"hello\"[^>]*>.*?<\\/h2[^>]*>.*?(<p.*?)<[^\\/p]/sim";
// get the match ... the (.*?) in the above regex
preg_match($re,$html,$matches);
print $matches[1];
Would output:
<p>I want this p-tag, that is after the h2 above<p>
<p>me too!</p>

Regular expression missed first occurrence of target string

I am using regular expression to fetch both text1 and text2 in the following html code. Here is what I am using:
/<div\s?class="right-col">[\s\n\S]*<a[\s\n]?[^>]*>#(.*)<\/a>/
but apparently I missed text1, only got text2(here is the link to my problem).
<div class="right-col">
<h1>
title1
</h1>
<p>some text here</p>
<div class="some-class">
<div class="left">
<span>some text here </span>
</div>
<div class="postmeta"><a href="url-link-here" >#text1</a> </div>
</div>
<div class="right-col">
<h1>
title2
</h1>
<p>some text here</p>
<div class="some-class">
<div class="left">
<span>some text here </span>
</div>
<div class="postmeta"><a href="url-link-here" >#text2</a> </div>
</div>
Can you guys tell me what went wrong in my regular expression? Is there a better way to capture both title1, title2 and text1, text2?
Using a regular expression here is not the best way to do it. It's bad practice. You should be using a DOM/XML parser to do this.
I like using PHP's DOMDocument class. Using XPath, we can quickly find the elements you want
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$aTags = $xPath->query('//div[#class="some-class"]//a[starts-with(text(), "#")]');
foreach($aTags as $a){
echo $a->nodeValue;
}
DEMO: http://codepad.viper-7.com/QHOXzH
This is a fairly common issue with regular expressions as they are greedy. [\s\S]* (the \n is not needed) matches for the first '<' and 'a' and since it's greedy it will match those and continue. Adding a ? makes it not greedy and using your link returns both text1 and text2.
The short answer is to replace [\s\n\S]* with [\s\S]*? but as others have mentioned, this is probably not a good use of regular expressions.

PHP Regex - Remove text from HTML Tags

How to remove all text between tags.
Input
<div>
<p>testing</p>
<div>my world</div>
</div>
Output
<div>
<p></p>
<div></div>
</div>
You can use either DOMDocument or PHP Simple HTML DOM Parser.
The following example uses the latter, although you may want to use what suits you best.
include("simple_html_dom.php");
$str = '
<div>
<p>testing</p>
<div>my world</div>
</div>
';
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
echo $html;
You could use two capture groups which would eliminate characters between them while replacing:
(\<.+\>).*(\<\/.+\>)
working example: http://ideone.com/Oq14El

Categories