Remove <p><br/></p> with DOMxpath or regex? - php

I use DOMxpath to remove html tags that have empty text node but to keep <br/> tags,
$xpath = new DOMXPath($dom);
while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0)
{
foreach ($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
}
it works perfectly until I came across another problem,
$content = '<p><br/><br/><br/><br/></p>';
How do remove this kind of messy <br/>and<p>? which means I don't want to allow <br/> alone with <p> but I allow <br/> with proper text like this only,
$content = '<p>first break <br/> second break <br/> the last line</p>';
Is that possible?
Or is it better with a regular expression?
I tried something like this,
$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]");
foreach($nodeList as $node)
{
$node->parentNode->removeChild($node);
}
but it return this error,
Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...

You can select the unwanted p using XPath:
"//p[count(*)=count(br) and br and normalize-space(.)='']"
Note to select empty-text nodes shouldn't you better use (?):
"//*[normalize-space(.)='' and not(self::br)]"
This will select any element (but br) whithout text nodes, nodes like:
<p><b/><i/></p>
or
<p> <br/> <br/>
</p>
included.

I have almost same situation, i use:
$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));
And use urlencode() to change it back for display or inserting to database.
Its work for me.

You could get rid of them all by simply checking to see that the only things within a paragraph are spaces and <br /> tags: preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);
Broken down:
\<p\> # Match for <p>
( # Beginning of a group
\s # Match a space character
| # or...
\<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and />
)* # Match this whole group (spaces or <br /> tags) 0 or more times.
\<\/p\> # Match for </p>
I will mention, however, that unless your HTML is well-formatted (one-line, no strange spaces or paragraph classes, etc), you should not use regex to parse this. If it is, this regex should work just fine.

Related

Regex to find anchor tag not working accurately

I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);
You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.
If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1
Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.

replace all occurrences of a string

I want to add a class to all p tags that contain arabic text in it. For example:
<p>لمبارة وذ</p>
<p>do nothing</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
should become
<p class="foo">لمبارة وذ</p>
<p>do nothing</p>
<p class="foo">خمس دقائق يخ</p>
<p class="foo">مراعاة إبقاء 3 لاعبين</p>
I am trying to use PHP preg_replace function to match the pattern (arabic) with following expression:
preg_replace("~(\p{Arabic})~u", "<p class=\"foo\">$1", $string, 1);
However it is not working properly. It has two problems:
It only matches the first paragraph.
Adds an empty <p>.
Sandbox Link
It only matches the first paragraph.
This is because you added the last argument, indicating you want only to replace the first occurrence. Leave that argument out.
Adds an empty <p>.
This is in fact the original <p> which you did not match. Just add it to the matching pattern, but keep it outside of the matching group, so it will be left out when you replace with $1.
Here is a corrected version, also on sandbox:
$text = preg_replace("~<p>(\p{Arabic}+)~u", "<p class=\"foo\">$1", $string);
Your first problem is that you weren't telling it to match the <p>, so it didn't.
Your main problem is that spaces aren't Arabic. Simply adding the alternative to match them fixes your problem:
$text = preg_replace("~<p>(\p{Arabic}*|\s*)~u", "<p class=\"foo\">$1", $string);
Using DOMDocument and DOMXPath:
$html = <<<'EOD'
<p>لمبارة وذ</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
// here you register the php namespace and the preg_match function
// to be able to use it in the XPath query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// select only p nodes with at least one arabic letter
$pNodes = $xpath->query("//p[php:functionString('preg_match', '~\p{Arabic}~u', .) > 0]");
foreach ($pNodes as $pNode) {
$pNode->setAttribute('class', 'foo');
}
$result = '';
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
echo $result;

preg_match all paragraphs in a string

The following string contains multiple <p> tags. I want to match the contents of each of the <p> with a pattern, and if it matches, I want to add a css class to that specific paragraph.
For example in the following string, only the second paragraph content matches, so i want to add a class to that paragraph only.
$string = '<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>';
With the following code, I can match all of the string, but I am unable to figure out how to find the specific paragraph.
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
$return = preg_match($rtl_chars_pattern, $string);
Create a capture group on the <p> tag
Use preg_replace
https://regex101.com/r/nE5pT1/1
$str = "<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>";
$result = preg_replace("/(<p>)[\\x{0590}-\\x{05ff}\\x{0600}-\\x{06ff}]/u", "<p class=\"foo\">", $str, 1);
Use a combination of SimpleXML, XPath and regular expressions (regex on text(), etc. are only supported as of XPath 2.0).
The steps:
Load the DOM first
Get all p tags via an xpath query
If the text / node value matches your regex, apply a css class
This is the actual code:
<?php
$html = "<html><p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p></html>";
$xml = simplexml_load_string($html);
# query the dom for all p tags
$ptags = $xml->xpath("//p");
# your regex
$regex = '~[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]~u';
# alternatively:
# $regex = '~\p{Arabic}~u';
# loop over the tags, if the regex matches, add another attribute
foreach ($ptags as &$p) {
if (preg_match($regex, (string) $p))
$p->addAttribute('class', 'some cool css class');
}
# just to be sure the tags have been altered
echo $xml->asXML();
?>
See a demo on ideone.com. The code has the advantage that you only analyze the content of the p tag, not the DOM structure in general.

How to get string from HTML with regex?

I'm trying to parse block from html page so i try to preg_match this block with php
if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t))
but doesn't work
</div>
blablabla
blablabla
blablabla
<div class="adsdiv">
i want grep only blablabla blablabla words
any help
Regex aint the right tool for this. Here is how to do it with DOM
$html = <<< HTML
<div class="parent">
<div>
<p>previous div<p>
</div>
blablabla
blablabla
blablabla
<div class="adsdiv">
<p>other content</p>
</div>
</div>
HTML;
Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div with class attribute of adsdiv
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[#class="adsdiv"]');
foreach($nodes as $node) {
foreach($node->parentNode->childNodes as $child) {
if($child instanceof DOMText) {
echo $child->nodeValue;
}
};
}
Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to
$nodes = $xPath->query('//div[#class="adsdiv"]/../text()');
foreach($nodes as $node) {
echo $node->nodeValue;
}
I kept it deliberatly verbose to illustrate how to use DOM though.
Apart from what has been said above, also add the /s modifier so . will match newlines. (edit: as Alan kindly pointed out, [^<]+ will match newlines anyway)
I always use /U as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i since people say <div>, <DIV>, or even <Div>...
if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
echo "Found: ".$match[1]."<br>";
} else {
echo "Not found<br>";
}
edit made it a little more explicit!
From the PHP Manual:
s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the
pattern matches all characters,
including newlines. Without it,
newlines are excluded. This modifier
is equivalent to Perl's /s modifier. A
negative class such as [^a] always
matches a newline character,
independent of the setting of this
modifier.
So, the following should work:
if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))
The ~ are there to delimit the regular expression.
You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories