I've been using this regular express (probably found on stackoverflow a few years back) to convert mailto tags in PHP:
preg_match_all("/<a([ ]+)href=([\"']*)mailto:(([[:alnum:]._\-]+)#([[:alnum:]._\-]+\.[[:alnum:]._\-]+))([\"']*)([[:space:][:alnum:]=\"_]*)>([^<|#]*)(#?)([^<]*)<\/a>/i",$content,$matches);
I pass it $content = 'somename#domain.com'
It returns these matched pieces:
0 somename#domain.com
1
2 "
3 name#domain.com
4 name
5 domain.com
6 "
7
8 somename
9 #
10 domain.com
Example usage: ucwords($matches[8][0])
My problem is, some links contain nested tags. Since the preg expression is looking for "<" to get pieces 8,9,10 and nested tags are throwing it off...
Example:
<span><b>somename#domain.com</b></span>
I need to ignore the nested tags and just extract the "some name" piece:
match part 8 = <span><b>
match part 9 = somename
match part 10 = #
match part 11 = domain.com
match part 12 = </b></span>
I've tried to get it to work by tweaking ([^<|#]*)(#?)([^<]*) but I can't figure out the right syntax to match or ignore the nested tags.
You could just replace the whole match between the <a> tag with a .*?. Replace ([^<|#]*)(#?)([^<]*) with (.*?) and it would include everything within the <a> tag including nested tags. You can remove the nested tags after that with striptags or another regex.
However, regular expressions are not very good at html nested tags. You are better off using something like DOMDocument, which is made exactly for parsing html. Something like:
<?php
$DOM = new DOMDocument();
$DOM->loadXML('<span><b>somename#domain.com</b></span>');
$list = $DOM->getElementsByTagName('a');
foreach($list as $link){
$href = $link->getAttribute('href');
$text = $link->nodeValue;
//only match if href starts with mailto:
if(stripos($href, 'mailto:') === 0){
var_dump($href);
var_dump($text);
}
}
http://codepad.viper-7.com/SqDKgr
To only get access to the part within the link, try
[^>]*>([^>]+)#.*
What you need should be in the first group of the result.
You can try this pattern:
$pattern = '~\bhref\s*+=\s*+(["\'])mailto:\K(?<mail>(?<name>[^#]++)#(?<domain>.*?))\1[^>]*+>(?:\s*+</?(?!a\b)[^>]*+>\s*+)*+(?<content>[^<]++)~i';
preg_match_all($pattern, $html, $matches, PREG_SET_ORDER);
echo '<pre>' . print_r($matches, true) . '</pre>';
and you can access your data like that:
echo $matches[0]['name'];
Try this regex
/^(<.*>)(.*)(#)/
/^/- Start of string
/(<.*>)/ - First match group, starts with < then anything in between until it hits >
/(.*)(#)/ - Match anything up to the parenthesis
Related
I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);
You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.
If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1
Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.
I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).
<p>I am <30 years old and weight <12st</p>
When I have used strip_tags it is only showing I am.
I understand why the strip_tags is doing that so I need to replace the 2 instances of the < with <
I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.
/<([^>]*)(<|$)/
which results in I am currently <30 years old and less than
I have a demo here https://eval.in/1117956
It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control.
There're two possible attitudes:
There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view.
May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument. Even if the libxml parser will not give better results than strip_tags, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors().
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
demo
$patternTemplate is a formatted string (see sprintf in the php manual) in which the placeholders %d stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
try this
$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !##$%^&*().]/u', '', $html); //remove special character
Live DEMO
A simple use of str_replace() would do it.
Replace the <p> and </p> with [p] and [/p]
replace the < with <
put the p tags back i.e. Replace the [p] and [/p] with <p> and </p>
Code
<?php
$description = "<p>I am <30 years old and weight <12st</p>";
$d = str_replace(['[p]','[/p]'],['<p>','</p>'],
str_replace('<', '<',
str_replace(['<p>','</p>'], ['[p]','[/p]'],
$description)));
echo $d;
RESULT
<p>I am <30 years old and weight <12st</p>
My guess is that here we might want to design a good right boundary to capture < in non-tags, maybe a simple expression similar to:
<(\s*[+-]?[0-9])
might work, since we should normally have numbers or signs right after <. [+-]?[0-9] would likely change, if we would have other instances after <.
Demo
Test
$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am < 30 years old and weight < 12st I am <30 years old and weight < -12st I am < +30 years old and weight < 12st</p>';
$subst = '<$1';
$result = preg_replace($re, $subst, $str);
echo $result;
The following string contains multiple <p> tags. I want to match the contents of each of the <p> with a pattern, and if it matches, I want to add a css class to that specific paragraph.
For example in the following string, only the second paragraph content matches, so i want to add a class to that paragraph only.
$string = '<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>';
With the following code, I can match all of the string, but I am unable to figure out how to find the specific paragraph.
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
$return = preg_match($rtl_chars_pattern, $string);
Create a capture group on the <p> tag
Use preg_replace
https://regex101.com/r/nE5pT1/1
$str = "<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>";
$result = preg_replace("/(<p>)[\\x{0590}-\\x{05ff}\\x{0600}-\\x{06ff}]/u", "<p class=\"foo\">", $str, 1);
Use a combination of SimpleXML, XPath and regular expressions (regex on text(), etc. are only supported as of XPath 2.0).
The steps:
Load the DOM first
Get all p tags via an xpath query
If the text / node value matches your regex, apply a css class
This is the actual code:
<?php
$html = "<html><p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p></html>";
$xml = simplexml_load_string($html);
# query the dom for all p tags
$ptags = $xml->xpath("//p");
# your regex
$regex = '~[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]~u';
# alternatively:
# $regex = '~\p{Arabic}~u';
# loop over the tags, if the regex matches, add another attribute
foreach ($ptags as &$p) {
if (preg_match($regex, (string) $p))
$p->addAttribute('class', 'some cool css class');
}
# just to be sure the tags have been altered
echo $xml->asXML();
?>
See a demo on ideone.com. The code has the advantage that you only analyze the content of the p tag, not the DOM structure in general.
How can I exclude href matches for a domain (ex. one.com)?
My current code:
$str = 'This string has one link and another link';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com
Desired result:
This string has one link and http://two.com
Using a regular expression
If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.
Here's how the regular expression would look like:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
Regex Visualization:
Regex101 demo
Using a DOM parser
Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:
Here's how the solution would look like:
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the string containing markup
$links = $dom->getElementsByTagName('a');
//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
$node = $links->item($i);
$text = $node->textContent;
$href = $node->getAttribute('href');
if ($href !== 'http://one.com') {
$newTextNode = $dom->createTextNode($text);
$node->parentNode->replaceChild($newTextNode, $node);
}
}
echo $dom->saveHTML();
Live Demo
This should do it:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
We use a negative lookahead to make sure that one.com does not appear directly after the https?://.
If you also want to check for some subdomains of one.com, use this example:
<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>
Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:
<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>
I have an html (sample.html) like this:
<html>
<head>
</head>
<body>
<div id="content">
<!--content-->
<p>some content</p>
<!--content-->
</div>
</body>
</html>
How do i get the content part that is between the 2 html comment '<!--content-->' using php? I want to get that, do some processing and place it back, so i have to get and put! Is it possible?
esafwan - you could use a regex expression to extract the content between the div (of a certain id).
I've done this for image tags before, so the same rules apply. i'll look out the code and update the message in a bit.
[update] try this:
<?php
function get_tag( $attr, $value, $xml ) {
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';
preg_match($tag_regex,
$xml,
$matches);
return $matches[1];
}
$yourentirehtml = file_get_contents("test.html");
$extract = get_tag('id', 'content', $yourentirehtml);
echo $extract;
?>
or more simply:
preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1];
jim
If this is a simple replacement that does not involve parsing of the actual HTML document, you may use a Regular Expression or even just str_replace for this. But generally, it is not a advisable to use Regex for HTML because HTML is not regular and coming up with reliable patterns can quickly become a nightmare.
The right way to parse HTML in PHP is to use a parsing library that actually knows how to make sense of HTML documents. Your best native bet would be DOM but PHP has a number of other native XML extensions you can use and there is also a number of third party libraries like phpQuery, Zend_Dom, QueryPath and FluentDom.
If you use the search function, you will see that this topic has been covered extensively and you should have no problems finding examples that show how to solve your question.
<?php
$content=file_get_contents("sample.html");
$comment=explode("<!--content-->",$content);
$comment=explode("<!--content-->",$comment[1]);
var_dump(strip_tags($comment[0]));
?>
check this ,it will work for you
Problem is with nested divs
I found solution here
<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
// where "contents" may contain nested <div>s.
// Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{ # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*> # match the "main" class DIV opening tag
( # capture "main" DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the "main" class DIV closing tag
}six'; // single-line (dot matches all), ignore case and free spacing modes ON
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(? 1)</div>)*)</div>}si';
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
echo("$matchcount matches found.\n");
// print_r($matches);
for($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i]); // print 1st capture group for match number i
}
} else {
echo('No matches');
}
echo("\n</pre>");
?>
Have a look here for a code example that means you can load a HTML document into SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html
You can then treat it as a normal SimpleXML object.
EDIT: This will only work if you want the content in a tag (e.g. between <div> and </div>)