I want to add a class to all p tags that contain arabic text in it. For example:
<p>لمبارة وذ</p>
<p>do nothing</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
should become
<p class="foo">لمبارة وذ</p>
<p>do nothing</p>
<p class="foo">خمس دقائق يخ</p>
<p class="foo">مراعاة إبقاء 3 لاعبين</p>
I am trying to use PHP preg_replace function to match the pattern (arabic) with following expression:
preg_replace("~(\p{Arabic})~u", "<p class=\"foo\">$1", $string, 1);
However it is not working properly. It has two problems:
It only matches the first paragraph.
Adds an empty <p>.
Sandbox Link
It only matches the first paragraph.
This is because you added the last argument, indicating you want only to replace the first occurrence. Leave that argument out.
Adds an empty <p>.
This is in fact the original <p> which you did not match. Just add it to the matching pattern, but keep it outside of the matching group, so it will be left out when you replace with $1.
Here is a corrected version, also on sandbox:
$text = preg_replace("~<p>(\p{Arabic}+)~u", "<p class=\"foo\">$1", $string);
Your first problem is that you weren't telling it to match the <p>, so it didn't.
Your main problem is that spaces aren't Arabic. Simply adding the alternative to match them fixes your problem:
$text = preg_replace("~<p>(\p{Arabic}*|\s*)~u", "<p class=\"foo\">$1", $string);
Using DOMDocument and DOMXPath:
$html = <<<'EOD'
<p>لمبارة وذ</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
// here you register the php namespace and the preg_match function
// to be able to use it in the XPath query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// select only p nodes with at least one arabic letter
$pNodes = $xpath->query("//p[php:functionString('preg_match', '~\p{Arabic}~u', .) > 0]");
foreach ($pNodes as $pNode) {
$pNode->setAttribute('class', 'foo');
}
$result = '';
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
echo $result;
Related
I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);
You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.
If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1
Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.
I want to convert into a string the html contained between these comments
<!--content-start-->
desired html
<!--content-end-->
so I use pregmatch, right?
preg_match("/<!--content-start-->(.*)<!--content-end-->/i", $rss, $content);
but it wont work. Maybe a problem with the REGEX?
Thank you.
Perhaps a /s modifier will help. Check the documentation:
s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters,
including newlines. Without it, newlines are excluded. This modifier is equivalent to
Perl's /s modifier. A negative class such as [^a] always matches a newline character,
independent of the setting of this modifier.
Something like this should work. The XPath query looks for a comment containing "content-start" and then returns the sibling nodes following it. We loop through until we find the closing comment.
$html = <<< HTML
<!--content-start-->
<p>Here is my <i>desired html</i></p>
<!-- a comment -->
<div class="foo">Here is more</div>
<!--content-end-->
<p>Not returning this</p>
HTML;
$return = "";
$dom = new DomDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xpath = new DomXpath($dom);
$siblings = $xpath->query("//comment()[.='content-start']/following-sibling::node()");
foreach ($siblings as $node) {
if ($node instanceof DOMComment && $node->textContent === "content-end") {
break;
}
$return .= $dom->saveHTML($node) . "\n";
}
echo $return;
Output:
<p>Here is my <i>desired html</i></p>
<!-- a comment -->
<div class="foo">Here is more</div>
The following string contains multiple <p> tags. I want to match the contents of each of the <p> with a pattern, and if it matches, I want to add a css class to that specific paragraph.
For example in the following string, only the second paragraph content matches, so i want to add a class to that paragraph only.
$string = '<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>';
With the following code, I can match all of the string, but I am unable to figure out how to find the specific paragraph.
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
$return = preg_match($rtl_chars_pattern, $string);
Create a capture group on the <p> tag
Use preg_replace
https://regex101.com/r/nE5pT1/1
$str = "<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>";
$result = preg_replace("/(<p>)[\\x{0590}-\\x{05ff}\\x{0600}-\\x{06ff}]/u", "<p class=\"foo\">", $str, 1);
Use a combination of SimpleXML, XPath and regular expressions (regex on text(), etc. are only supported as of XPath 2.0).
The steps:
Load the DOM first
Get all p tags via an xpath query
If the text / node value matches your regex, apply a css class
This is the actual code:
<?php
$html = "<html><p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p></html>";
$xml = simplexml_load_string($html);
# query the dom for all p tags
$ptags = $xml->xpath("//p");
# your regex
$regex = '~[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]~u';
# alternatively:
# $regex = '~\p{Arabic}~u';
# loop over the tags, if the regex matches, add another attribute
foreach ($ptags as &$p) {
if (preg_match($regex, (string) $p))
$p->addAttribute('class', 'some cool css class');
}
# just to be sure the tags have been altered
echo $xml->asXML();
?>
See a demo on ideone.com. The code has the advantage that you only analyze the content of the p tag, not the DOM structure in general.
The following situation:
$text = "This is some <span class='classname'>example</span> text i'm writing to
demonstrate the <span class='classname otherclass'>problem</span> of this.<br />";
preg_match_all("|<[^>/]*(classname)(.+)>(.*)</[^>]+>|U", $text, $matches, PREG_PATTERN_ORDER);
I need an array ($matches) where in one field is "<span class='classname'>example</span>" and in another "example".
But what i get here is one field with "<span class='classname'>example</span>" and one with "classname".
It also should contain the values for the other matches, of course.
how can i get the right values?
You would be better off with a DOM parser, however this question is more to do with how capturing works in Regexes in general.
The reason you are getting classname as a match is because you are capturing it by putting () around it. They are completely unnecessary so you can just remove them. Similarly, you don't need them around .+ since you don't want to capture that.
If you had some group that you had to enclose in () as grouping rather than capturing, start the group with ?: and it won't be captured.
The safe/easy way:
$text = 'blah blah blah';
$dom = new DOM();
$dom->loadHTML($text);
$xp = new DOMXPath($dom);
$nodes = $xp->query("//span[#class='classname']");
foreach($nodes as $node) {
$innertext = $node->nodeValue;
$html = // see http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument
}
I have code with several lines like this
<p> <inset></p>
Where there may be any number of spaces or tabs (or none) between the opening <p> tag and the rest if the string. I need to replace these, but I can't get it to work.
I thought this would do it, but it doesn't work:
<p>[ \t]+<inset></p>
Try this:
$html = preg_replace('#(<p>)\s+(<inset></p>)#', '$1$2', $html);
If you want true text-trimming for HTML including everything you can encounter like those entitites, comments, child-elements and all that stuff, you can make use of a TextRangeTrimmer and TextRange:
$htmlFragment = '<p> <inset></p>';
$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
throw new Exception('Parent element not found.');
}
$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->ltrim();
// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
echo $dom->saveHTML($node);
}
Output:
<p><inset></p>
I've both classes in a gist: https://gist.github.com/1894360/ (codepad viper is down).
See as well the related questions / answers:
Wordwrap / Cut Text in HTML string
Ignore html tags in preg_replace
Try to load your HTML string into a DOM tree instead, and then trim all the text values in the tree.
http://php.net/domdocument.loadhtml
http://php.net/trim