The following string contains multiple <p> tags. I want to match the contents of each of the <p> with a pattern, and if it matches, I want to add a css class to that specific paragraph.
For example in the following string, only the second paragraph content matches, so i want to add a class to that paragraph only.
$string = '<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>';
With the following code, I can match all of the string, but I am unable to figure out how to find the specific paragraph.
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
$return = preg_match($rtl_chars_pattern, $string);
Create a capture group on the <p> tag
Use preg_replace
https://regex101.com/r/nE5pT1/1
$str = "<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>";
$result = preg_replace("/(<p>)[\\x{0590}-\\x{05ff}\\x{0600}-\\x{06ff}]/u", "<p class=\"foo\">", $str, 1);
Use a combination of SimpleXML, XPath and regular expressions (regex on text(), etc. are only supported as of XPath 2.0).
The steps:
Load the DOM first
Get all p tags via an xpath query
If the text / node value matches your regex, apply a css class
This is the actual code:
<?php
$html = "<html><p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p></html>";
$xml = simplexml_load_string($html);
# query the dom for all p tags
$ptags = $xml->xpath("//p");
# your regex
$regex = '~[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]~u';
# alternatively:
# $regex = '~\p{Arabic}~u';
# loop over the tags, if the regex matches, add another attribute
foreach ($ptags as &$p) {
if (preg_match($regex, (string) $p))
$p->addAttribute('class', 'some cool css class');
}
# just to be sure the tags have been altered
echo $xml->asXML();
?>
See a demo on ideone.com. The code has the advantage that you only analyze the content of the p tag, not the DOM structure in general.
Related
I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);
You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.
If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1
Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.
I want to add a class to all p tags that contain arabic text in it. For example:
<p>لمبارة وذ</p>
<p>do nothing</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
should become
<p class="foo">لمبارة وذ</p>
<p>do nothing</p>
<p class="foo">خمس دقائق يخ</p>
<p class="foo">مراعاة إبقاء 3 لاعبين</p>
I am trying to use PHP preg_replace function to match the pattern (arabic) with following expression:
preg_replace("~(\p{Arabic})~u", "<p class=\"foo\">$1", $string, 1);
However it is not working properly. It has two problems:
It only matches the first paragraph.
Adds an empty <p>.
Sandbox Link
It only matches the first paragraph.
This is because you added the last argument, indicating you want only to replace the first occurrence. Leave that argument out.
Adds an empty <p>.
This is in fact the original <p> which you did not match. Just add it to the matching pattern, but keep it outside of the matching group, so it will be left out when you replace with $1.
Here is a corrected version, also on sandbox:
$text = preg_replace("~<p>(\p{Arabic}+)~u", "<p class=\"foo\">$1", $string);
Your first problem is that you weren't telling it to match the <p>, so it didn't.
Your main problem is that spaces aren't Arabic. Simply adding the alternative to match them fixes your problem:
$text = preg_replace("~<p>(\p{Arabic}*|\s*)~u", "<p class=\"foo\">$1", $string);
Using DOMDocument and DOMXPath:
$html = <<<'EOD'
<p>لمبارة وذ</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
// here you register the php namespace and the preg_match function
// to be able to use it in the XPath query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// select only p nodes with at least one arabic letter
$pNodes = $xpath->query("//p[php:functionString('preg_match', '~\p{Arabic}~u', .) > 0]");
foreach ($pNodes as $pNode) {
$pNode->setAttribute('class', 'foo');
}
$result = '';
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
echo $result;
How can I exclude href matches for a domain (ex. one.com)?
My current code:
$str = 'This string has one link and another link';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com
Desired result:
This string has one link and http://two.com
Using a regular expression
If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.
Here's how the regular expression would look like:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
Regex Visualization:
Regex101 demo
Using a DOM parser
Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:
Here's how the solution would look like:
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the string containing markup
$links = $dom->getElementsByTagName('a');
//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
$node = $links->item($i);
$text = $node->textContent;
$href = $node->getAttribute('href');
if ($href !== 'http://one.com') {
$newTextNode = $dom->createTextNode($text);
$node->parentNode->replaceChild($newTextNode, $node);
}
}
echo $dom->saveHTML();
Live Demo
This should do it:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
We use a negative lookahead to make sure that one.com does not appear directly after the https?://.
If you also want to check for some subdomains of one.com, use this example:
<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>
Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:
<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>
I have an html (sample.html) like this:
<html>
<head>
</head>
<body>
<div id="content">
<!--content-->
<p>some content</p>
<!--content-->
</div>
</body>
</html>
How do i get the content part that is between the 2 html comment '<!--content-->' using php? I want to get that, do some processing and place it back, so i have to get and put! Is it possible?
esafwan - you could use a regex expression to extract the content between the div (of a certain id).
I've done this for image tags before, so the same rules apply. i'll look out the code and update the message in a bit.
[update] try this:
<?php
function get_tag( $attr, $value, $xml ) {
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';
preg_match($tag_regex,
$xml,
$matches);
return $matches[1];
}
$yourentirehtml = file_get_contents("test.html");
$extract = get_tag('id', 'content', $yourentirehtml);
echo $extract;
?>
or more simply:
preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1];
jim
If this is a simple replacement that does not involve parsing of the actual HTML document, you may use a Regular Expression or even just str_replace for this. But generally, it is not a advisable to use Regex for HTML because HTML is not regular and coming up with reliable patterns can quickly become a nightmare.
The right way to parse HTML in PHP is to use a parsing library that actually knows how to make sense of HTML documents. Your best native bet would be DOM but PHP has a number of other native XML extensions you can use and there is also a number of third party libraries like phpQuery, Zend_Dom, QueryPath and FluentDom.
If you use the search function, you will see that this topic has been covered extensively and you should have no problems finding examples that show how to solve your question.
<?php
$content=file_get_contents("sample.html");
$comment=explode("<!--content-->",$content);
$comment=explode("<!--content-->",$comment[1]);
var_dump(strip_tags($comment[0]));
?>
check this ,it will work for you
Problem is with nested divs
I found solution here
<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
// where "contents" may contain nested <div>s.
// Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{ # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*> # match the "main" class DIV opening tag
( # capture "main" DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the "main" class DIV closing tag
}six'; // single-line (dot matches all), ignore case and free spacing modes ON
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(? 1)</div>)*)</div>}si';
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
echo("$matchcount matches found.\n");
// print_r($matches);
for($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i]); // print 1st capture group for match number i
}
} else {
echo('No matches');
}
echo("\n</pre>");
?>
Have a look here for a code example that means you can load a HTML document into SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html
You can then treat it as a normal SimpleXML object.
EDIT: This will only work if you want the content in a tag (e.g. between <div> and </div>)
Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.
// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
// Split on > this should be the end of the right side of the anchor tag
$pieces = explode(">", $sObject->fields->$field);
// Split on < this should be the closing anchor tag
$piece = explode("<", $pieces[1]);
$fields_string .= $piece[0] . "\n";
}
item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.
PHP has a strip_tags() function.
Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.
Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).
I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!
I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:
'#<a></a>#'
Then we add in the text that could be between the tags.
We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.
'#<a>(.*?)</a>#'
Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.
'#<a href\="([^"]*)">(.*?)</a>#'
Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*.
Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.
The resulting RegEx (PCRE) is as following:
'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'
Now, in PHP, use the preg_match_all() function to grab all occurances in the string.
$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
{
$href = $link[2];
$text = $link[4];
}
use simplexml and xpath to retrieve the desired nodes
If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.
$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
<SOAP:Body>
<foo:bar xmlns:foo="urn:yaddayadda">
<fragment>
Mary had a
little lamb
</fragment>
</foo:bar>
</SOAP:Body>
</SOAP:Envelope>';
$doc = new DOMDocument;
$doc->loadxml($sr);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
echo $ns->item(0)->nodeValue;
}
prints
Mary had a
little lamb
If you want to strip or extract properties from only specific tag, you should try DOMDocument.
Something like this:
$TagWhiteList = array(
// Example of WhiteList
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getTextFromNode($Node, $Text = "") {
// No tag, so it is a text
if ($Node->tagName == null)
return $Text.$Node->textContent;
// You may select a tag here
// Like:
// if (in_array($TextName, $TagWhiteList))
// DoSomthingWithIt($Text,$Node);
// Recursive to child
$Node = $Node->firstChild;
if ($Node != null)
$Text = getTextFromNode($Node, $Text);
// Recursive to sibling
while($Node->nextSibling != null) {
$Text = getTextFromNode($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
function getTextFromDocument($DOMDoc) {
return getTextFromNode($DOMDoc->documentElement);
}
To use:
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";
The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.
Hope this help.