How to get string from HTML with regex?

How to get string from HTML with regex? - php

I'm trying to parse block from html page so i try to preg_match this block with php
if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t))
but doesn't work
</div>
blablabla
blablabla
blablabla
<div class="adsdiv">
i want grep only blablabla blablabla words
any help

Regex aint the right tool for this. Here is how to do it with DOM
$html = <<< HTML
<div class="parent">
<div>
<p>previous div<p>
</div>
blablabla
blablabla
blablabla
<div class="adsdiv">
<p>other content</p>
</div>
</div>
HTML;
Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div with class attribute of adsdiv
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[#class="adsdiv"]');
foreach($nodes as $node) {
foreach($node->parentNode->childNodes as $child) {
if($child instanceof DOMText) {
echo $child->nodeValue;
}
};
}
Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to
$nodes = $xPath->query('//div[#class="adsdiv"]/../text()');
foreach($nodes as $node) {
echo $node->nodeValue;
}
I kept it deliberatly verbose to illustrate how to use DOM though.

Apart from what has been said above, also add the /s modifier so . will match newlines. (edit: as Alan kindly pointed out, [^<]+ will match newlines anyway)
I always use /U as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i since people say <div>, <DIV>, or even <Div>...
if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
echo "Found: ".$match[1]."<br>";
} else {
echo "Not found<br>";
}
edit made it a little more explicit!

From the PHP Manual:
s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the
pattern matches all characters,
including newlines. Without it,
newlines are excluded. This modifier
is equivalent to Perl's /s modifier. A
negative class such as [^a] always
matches a newline character,
independent of the setting of this
modifier.
So, the following should work:
if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))
The ~ are there to delimit the regular expression.

You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

Related

Regex to find anchor tag not working accurately

I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);

You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.

If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1

Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.

using preg_match with html comments

I want to convert into a string the html contained between these comments
<!--content-start-->
desired html
<!--content-end-->
so I use pregmatch, right?
preg_match("/<!--content-start-->(.*)<!--content-end-->/i", $rss, $content);
but it wont work. Maybe a problem with the REGEX?
Thank you.

Perhaps a /s modifier will help. Check the documentation:
s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters,
including newlines. Without it, newlines are excluded. This modifier is equivalent to
Perl's /s modifier. A negative class such as [^a] always matches a newline character,
independent of the setting of this modifier.

Something like this should work. The XPath query looks for a comment containing "content-start" and then returns the sibling nodes following it. We loop through until we find the closing comment.
$html = <<< HTML
<!--content-start-->
<p>Here is my <i>desired html</i></p>
<!-- a comment -->
<div class="foo">Here is more</div>
<!--content-end-->
<p>Not returning this</p>
HTML;
$return = "";
$dom = new DomDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xpath = new DomXpath($dom);
$siblings = $xpath->query("//comment()[.='content-start']/following-sibling::node()");
foreach ($siblings as $node) {
if ($node instanceof DOMComment && $node->textContent === "content-end") {
break;
}
$return .= $dom->saveHTML($node) . "\n";
}
echo $return;
Output:
<p>Here is my <i>desired html</i></p>
<!-- a comment -->
<div class="foo">Here is more</div>

replace all occurrences of a string

I want to add a class to all p tags that contain arabic text in it. For example:
<p>لمبارة وذ</p>
<p>do nothing</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
should become
<p class="foo">لمبارة وذ</p>
<p>do nothing</p>
<p class="foo">خمس دقائق يخ</p>
<p class="foo">مراعاة إبقاء 3 لاعبين</p>
I am trying to use PHP preg_replace function to match the pattern (arabic) with following expression:
preg_replace("~(\p{Arabic})~u", "<p class=\"foo\">$1", $string, 1);
However it is not working properly. It has two problems:
It only matches the first paragraph.
Adds an empty <p>.
Sandbox Link

It only matches the first paragraph.
This is because you added the last argument, indicating you want only to replace the first occurrence. Leave that argument out.
Adds an empty <p>.
This is in fact the original <p> which you did not match. Just add it to the matching pattern, but keep it outside of the matching group, so it will be left out when you replace with $1.
Here is a corrected version, also on sandbox:
$text = preg_replace("~<p>(\p{Arabic}+)~u", "<p class=\"foo\">$1", $string);

Your first problem is that you weren't telling it to match the <p>, so it didn't.
Your main problem is that spaces aren't Arabic. Simply adding the alternative to match them fixes your problem:
$text = preg_replace("~<p>(\p{Arabic}*|\s*)~u", "<p class=\"foo\">$1", $string);

Using DOMDocument and DOMXPath:
$html = <<<'EOD'
<p>لمبارة وذ</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
// here you register the php namespace and the preg_match function
// to be able to use it in the XPath query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// select only p nodes with at least one arabic letter
$pNodes = $xpath->query("//p[php:functionString('preg_match', '~\p{Arabic}~u', .) > 0]");
foreach ($pNodes as $pNode) {
$pNode->setAttribute('class', 'foo');
}
$result = '';
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
echo $result;

Remove <p><br/></p> with DOMxpath or regex?

I use DOMxpath to remove html tags that have empty text node but to keep <br/> tags,
$xpath = new DOMXPath($dom);
while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0)
{
foreach ($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
}
it works perfectly until I came across another problem,
$content = '<p><br/><br/><br/><br/></p>';
How do remove this kind of messy <br/>and<p>? which means I don't want to allow <br/> alone with <p> but I allow <br/> with proper text like this only,
$content = '<p>first break <br/> second break <br/> the last line</p>';
Is that possible?
Or is it better with a regular expression?
I tried something like this,
$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]");
foreach($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
but it return this error,
Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...

You can select the unwanted p using XPath:
"//p[count(*)=count(br) and br and normalize-space(.)='']"
Note to select empty-text nodes shouldn't you better use (?):
"//*[normalize-space(.)='' and not(self::br)]"
This will select any element (but br) whithout text nodes, nodes like:
<p><b/><i/></p>
or
<p> <br/> <br/>
</p>
included.

I have almost same situation, i use:
$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));
And use urlencode() to change it back for display or inserting to database.
Its work for me.

You could get rid of them all by simply checking to see that the only things within a paragraph are spaces and <br /> tags: preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);
Broken down:
\<p\> # Match for <p>
( # Beginning of a group
\s # Match a space character
| # or...
\<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and />
)* # Match this whole group (spaces or <br /> tags) 0 or more times.
\<\/p\> # Match for </p>
I will mention, however, that unless your HTML is well-formatted (one-line, no strange spaces or paragraph classes, etc), you should not use regex to parse this. If it is, this regex should work just fine.

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.

Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));

This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.

$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.

You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).

<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.

HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get string from HTML with regex? - php

I'm trying to parse block from html page so i try to preg_match this block with php if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t)) but doesn't work </div> blablabla blablabla blablabla <div class="adsdiv"> i want grep only blablabla blablabla words any help

You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

Related

Regex to find anchor tag not working accurately

using preg_match with html comments

replace all occurrences of a string

Remove <p><br/></p> with DOMxpath or regex?

Regex / DOMDocument - match and replace text not in a link

Categories

Resources