regex exclude nested html tag - php

i have a piece of text:
<strong>blalblalba</strong>blasldasdsadasdasd<strong> 3.5m Euros<br>
<span class="style6">SOLD</span></strong>
and I want to remove <strong> contains $|euros|Euros</strong>
So far I have:
preg_replace('#<strong>.*?(^<strong>).*?(\$|euros|Euros|EUROS).*?</strong>#is', '', $result);
but it is not working... I was trying also negative lock head (?!) but still not working...
Any help? Thanks

With the assumption you expect two stong's before your Euros, I think this may be what you want: preg_replace('#^<strong>.*?<strong>.*?(\$[euros|Euros|EUROS]).*?</strong>#is', '', $result);

You can try this, must use 'Dot-All' modifier or substitute [\S\s] -
# <strong>(?:(?!\1)(?:\$|euros|Euros|EUROS)()|(?!<strong>).)+</strong>\1
<strong>
(?:
(?! \1 )
(?: \$ | euros | Euros | EUROS )
( )
|
(?! <strong> )
.
)+
</strong>
\1

Related

Match '$' Symbol in regex

I have the following expression:
"MSRP | <span style='text-decoration: line-through;'>$74,660</span><br /> Buy | $67,092
I need
MSRP $74,600 $67,092
I can't seem to find the regex to include the '$' symbol in the match group. This is what I am currently doing:
MSRP | <[^>]+>\$([^>]+)<[^>]+> *<[^>]+> *Buy | $([^>]+)\/i
What is wrong with this expression and why is it not including the '$' symbol?
You need to escape the $ and | and also put the \$inside the parenthesis to match in a group:
(MSRP) \| <[^>]+>(\$[^>]+)<[^>]+> *<[^>]+> *Buy \| (\$[^>]+)
If you want to have the $ sign together with the numbers in your results, you need to put it into the parenthesis:
Here is a working example with PHP:
$str = "MSRP | <span style='text-decoration: line-through;'>$74,660</span><br /> Buy | $67,092";
preg_match('/MSRP \| <[^>]+>(\$[^>]+)<[^>]+> *<[^>]+> *Buy \| (\$[^>]+)/i', $str, $matches);
I got the results like this:
1 => '$74,660', 2 => '$67,092'

[php, maybe regex]how to remove all strings, except [/] + [sequence of 4 or more numbers] (/1111)

I have a large string stored in a variable (big source code pages), I want everything to be removed, except the
values that are inside the href="HERE"
like this: href="/45214"
it is important that only values with this format be preserved: only one / + numbers, in sequences of 4 or more numbers
expected output:
/45214
I think it's something like this:
'/href=\"(\/)[0-9]/'
$source = '</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">health</span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/7487423'"></span>
</a>';
preg_match_all("/href=\"(\/)[0-9]/", $source, $results);
var_export(end($results));
expected output:
/45214
/7487423
Thanks
You can use DOMDocument to extract all href attribute values, and then check each with a simple '~^/\d{4,}$~' regex that matches
^ - start of string
/ - a slash
\d{4,} - 4+ digits
$ - end of string.
PHP code:
$html = "YOUR_HTML_CODE";
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$results = [];
foreach ($xpath->query('//*/#href') as $val) {
if (preg_match('~^/\d{4,}$~', $val->value)) {
array_push($results, $val->value);
}
}
print_r($results);
Output:
Array
(
[0] => /45214
[1] => /7487423
)
See the PHP demo.
Altho' the OP asks for a PHP solution, since it involves HTML, you could also use JavaScript and a regex as follows:
var d = document;
d.g = d.getElementsByTagName;
var aTags = d.g("a");
var matches = [];
var re = /\/\d{4,}/;
for (var i=0, max = aTags.length; i <= max - 1; i++) {
matches[i] = re.exec(aTags[i].href);
}
d.body.innerHTML="";
console.log(matches);
</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/7487423'"></span>
</a>
Use href=\"(\/)[0-9]{4,} regex, {4,} makes sure to capture 4 or more consecutive numbers.
See example https://regex101.com/r/BlKv9L/1/
$re = '/href=\"(\/)[0-9]{4,}/m';
$str = ' <a href="/45214" title="care with your skin against
<a href="/452143232" title="care with your skin against
<a href="/214" title="care with your skin against
<a href="/543543545214" title="care with your skin against
<a href="/45215434" title="care with your skin against
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
can be checked here
<(([^<>"]+"){2})*[^<>"]*href="\K[^"]+
scraper series:
You can use preg_match_all() in an efficient way with a regex that is safe for tag parsing.
The nice feature of this is that it won't error out if malformed html
and it won't look for it inside invisible content (such as comments, etc...).
PHP running code
http://sandbox.onlinephpfunctions.com/code/a182a6d57e887d44f9040166cf57fbb3486bb183
<?php
$string = ' HTML ';
preg_match_all
(
'~(?si)(?:<[\w:]+(?=(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)href\s*=\s*(?:([\'"])\s*(/\d{4,})\s*\1))\s+(?:".*?"|\'.*?\'|[^>]*?)+>\K|<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>".*?"|\'.*?\'|(?:(?!/>)[^>])?)+)?\s*>).*?</\3\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:".*?"|\'.*?\'|[^>]?)+\s*/?)|\?.*?\?|(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?))))>(*SKIP)(?!))~',
$string,
$matches,
PREG_PATTERN_ORDER
);
print_r( $matches[2] );
Output
Array
(
[0] => /45214
[1] => /7487423
)
Regex explained
(?si) # Modifier, dot-all and ignore case
(?:
# What we want to examine, any tag with href attribute
< [\w:]+
(?= # Assertion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
(?<= \s )
href \s* = \s* # href attribute
(?:
( ['"] ) # (1), # quote begin
\s*
( # (2 start)
/ \d{4,} # /dddd (slash, 4 or more digits) to be saved
) # (2 end)
\s*
\1 # quote end
)
)
\s+
(?: " .*? " | ' .*? ' | [^>]*? )+
>
\K # Don't store this match, we already have capture group 2 value
|
# OR,
# Match, but skip these (this just advances the current position)
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (3 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (3 end)
(?:
\s+
(?>
" .*? "
| ' .*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
.*? </ \3 \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" .*? "
| ' .*? '
| [^>]?
)+
\s* /?
)
| \? .*? \?
| (?:
!
(?:
(?: DOCTYPE .*? )
| (?: \[CDATA\[ .*? \]\] )
| (?: -- .*? -- )
| (?: ATTLIST .*? )
| (?: ENTITY .*? )
| (?: ELEMENT .*? )
)
)
)
>
(*SKIP)
(?!)
)

Text to html ratio of a page issue

I am trying to get Text to HTML Ratio on a given webpage. I am using a strip_html_tags to strip out the html tags and comparing it to the original content on the page to get the ratio. My issue is that I feel like my strip_html_tags function may not get all the tags on webpage. Is there a better way to do this... maybe that just replaces everything that starts with < and >. I can already point out that I am missing a lot of tags that should be stripped in the regex but there has to be a better way to do all this.
function strip_html_tags($text)
{
$text = preg_replace(array(
'#<head[^>]*?>.*?</head>#siu',
'#<style[^>]*?>.*?</style>#siu',
'#<script[^>]*?.*?</script>#siu',
'#<object[^>]*?.*?</object>#siu',
'#<embed[^>]*?.*?</embed>#siu',
'#<applet[^>]*?.*?</applet>#siu',
'#<noframes[^>]*?.*?</noframes>#siu',
'#<noscript[^>]*?.*?</noscript>#siu',
'#<noembed[^>]*?.*?</noembed>#siu',
'#</?((address)|(blockquote)|(center)|(del))#iu',
'#</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))#iu',
'#</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))#iu',
'#</?((table)|(th)|(td)|(caption))#iu',
'#</?((form)|(button)|(fieldset)|(legend)|(input))#iu',
'#</?((label)|(select)|(optgroup)|(option)|(textarea))#iu',
'#</?((frameset)|(frame)|(iframe))#iu',
'#<[\/\!]*?[^<>]*?>#siu', // Strip out HTML tags
'#<![\s\S]*?--[ \t\n\r]*>#siu' // Strip multi-line comments including CDATA
), array(
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0"
), $text);
return strip_tags($text);
}
function check_ratio($url)
{
$file_content = // getting data from curl request here
$page_size = mb_strlen($file_content, '8bit');
$content = strip_html_tags($file_content);
$text_size = mb_strlen($content, '8bit');
$content = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", " ", $content);
$len_real = strlen($file_content);
$len_strip = strlen($content);
return round((($len_strip / $len_real) * 100), 2);
}
why are you reinventing the wheel?
here's the better way:
http://php.net/manual/en/function.strip-tags.php
DOMNode::$textContent can be a starting point:
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents('http://www.google.com'));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName('body');
var_dump($items[0]->textContent);
It also includes data from tags you probably won't consider "text", such as <style> or <script> but it shouldn't be difficult to take that into account.
This is using a regex.
Update 1:
-Have to add an atomic group around the tag body of invisible content,
or could cause catastrophic backtracking if quotes are unbalanced.
-Added list of invisible content it will remove:
script, style, head, object, embed, applet, noframes, noscript, noembed
If no closing tag, just the tag will be removed, otherwise it's content is removed with the tags.
DEMO
Find Raw Regex
<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Replace with nothing.
Various stringed / delimited representations
Delimiter only: /<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/
Single Quote & Delimiter: '/<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|\'[\S\s]*?\'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|\'[\S\s]*?\'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/'
Double Quote only: "<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
Expanded
# <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| head
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
[\S\s]*? </ \1 \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
\s* /?
)
| \? [\S\s]*? \?
| (?:
!
(?:
(?: DOCTYPE [\S\s]*? )
| (?: \[CDATA\[ [\S\s]*? \]\] )
| (?: -- [\S\s]*? -- )
| (?: ATTLIST [\S\s]*? )
| (?: ENTITY [\S\s]*? )
| (?: ELEMENT [\S\s]*? )
)
)
)
>
Benchmark:
Regex1: <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Options: < none >
Completed iterations: 3 / 3 ( x 1000 )
Matches found per iteration: 3780
Elapsed Time: 43.52 s, 43523.08 ms, 43523084 µs
Sample Analysis, page size 126,000 bytes:
3,780 tags / page
x 3,000 iterations
--------------------------
11,340,000 total tags
/ 43.52 seconds
--------------------------
260,569 tags / second
/ 3,780 tags / page
--------------------------
70 pages / second

preg_split shortcode attributes into array

I would like to parse shortcode into array via "preg_split".
This is example shortcode:
[contactform id="8411" label="This is \" first label" label2='This is second \' label']
and this should be result array:
Array
(
[id] => 8411
[label] => This is \" first label
[label2] => This is second \' label
)
I have this regexp:
$atts_arr = preg_split('~\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)~', trim($shortcode, '[]'));
Unfortunately, this works only if there is no escaping of quotes \' or \".
Thx in advance!
Using preg_split is not always handy or appropriate in particular when you have to deal with escaped quotes. So, a better approach consists to use preg_match_all, example:
$pattern = <<<'EOD'
~
(\w+) \s*=
(?|
\s* "([^"\\]*(?:\\.[^"\\]*)*)"
|
\s* '([^'\\]*(?:\\.[^'\\]*)*)'
# | uncomment if you want to handle unquoted attributes
# ([^]\s]*)
)
~xs
EOD;
if (preg_match_all($pattern, $yourshortcode, $matches))
$attributes = array_combine($matches[1], $matches[2]);
The pattern uses the branch reset feature (?|...(..)...|...(...)..) that gives the same number(s) to the capture groups for each branch.
I was speaking about the \G anchor in my comment, this anchor succeeds if the current position is immediatly after the last match. It can be useful if you want to check the syntax of your shortcode from start to end at the same time (otherwise it is totally useless). Example:
$pattern2 = <<<'EOD'
~
(?:
\G(?!\A) # anchor for the position after the last match
# it ensures that all matches are contiguous
|
\[(?<tagName>\w+) # begining of the shortcode
)
\s+
(?<key>\w+) \s*=
(?|
\s* "(?<value>[^"\\]*(?:\\.[^"\\]*)*)"
|
\s* '([^'\\]*(?:\\.[^'\\]*)*')
# | uncomment if you want to handle unquoted attributes
# ([^]\s]*)
)
(?<end>\s*+]\z)? # check that the end has been reached
~xs
EOD;
if (preg_match_all($pattern2, $yourshortcode, $matches) && isset($matches['end']))
$attributes = array_combine($matches['key'], $matches['value']);

php regex: Use quotes for match, but don't capture them

I'm unsure if I should be using preg_match, preg_match_all, or preg_split with delim capture. I'm also unsure of the correct regex.
Given the following:
$string = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
I want to get an array with the following elems:
[0] = "ok"
[1] = "that\'s"
[2] = "yeah that's \"cool\""
You can not do this with a regular expression because you're trying to parse a non-context-free grammar. Write a parser.
Outline:
read character by character, if you see a \ remember it.
if you see a " or ' check if the previous character was \. You now have your delimiting condition.
record all the tokens in this manner
Your desired result set seems to trim spaces, you also lost a couple of the \s, perhaps this is a mistake but it can be important.
I would expect:
[0] = " ok " // <-- spaces here
[1] = "that\\'s cool"
[2] = " \"yeah that's \\\"cool\\\"\"" // leading space here, and \" remains
Actually, you might be surprised to find that you can do this in regex:
preg_match_all("((?|\"((?:\\\\.|[^\"])+)\"|'((?:\\\\.|[^'])+)'|(\w+)))",$string,$m);
The desired result array will be in $m[1].
You can do it with a regex:
$pattern = <<<'LOD'
~
(?J)
# Definitions #
(?(DEFINE)
(?<ens> (?> \\{2} )+ ) # even number of backslashes
(?<sqc> (?> [^\s'\\]++ | \s++ (?!'|$) | \g<ens> | \\ '?+ )+ ) # single quotes content
(?<dqc> (?> [^\s"\\]++ | \s++ (?!"|$) | \g<ens> | \\ "?+ )+ ) # double quotes content
(?<con> (?> [^\s"'\\]++ | \s++ (?!["']|$) | \g<ens> | \\ ["']?+ )+ ) # content
)
# Pattern #
\s*+ (?<res> \g<con>)
| ' \s*+ (?<res> \g<sqc>) \s*+ '?+
| " \s*+ (?<res> \g<dqc>) \s*+ "?+
~x
LOD;
$subject = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
var_dump($match['res']);
}
I made the choice to trim spaces in all results, then " abcd " will give abcd. This pattern allows all backslashes you want, anywhere you want. If a quoted string is not closed at the end of the string, the end of the string is considered as the closing quote (this is why i have made the closing quotes optional). So, abcd " ef'gh will give you abcd and ef'gh

Categories