Text to html ratio of a page issue

Text to html ratio of a page issue - php

I am trying to get Text to HTML Ratio on a given webpage. I am using a strip_html_tags to strip out the html tags and comparing it to the original content on the page to get the ratio. My issue is that I feel like my strip_html_tags function may not get all the tags on webpage. Is there a better way to do this... maybe that just replaces everything that starts with < and >. I can already point out that I am missing a lot of tags that should be stripped in the regex but there has to be a better way to do all this.
function strip_html_tags($text)
{
$text = preg_replace(array(
'#<head[^>]*?>.*?</head>#siu',
'#<style[^>]*?>.*?</style>#siu',
'#<script[^>]*?.*?</script>#siu',
'#<object[^>]*?.*?</object>#siu',
'#<embed[^>]*?.*?</embed>#siu',
'#<applet[^>]*?.*?</applet>#siu',
'#<noframes[^>]*?.*?</noframes>#siu',
'#<noscript[^>]*?.*?</noscript>#siu',
'#<noembed[^>]*?.*?</noembed>#siu',
'#</?((address)|(blockquote)|(center)|(del))#iu',
'#</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))#iu',
'#</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))#iu',
'#</?((table)|(th)|(td)|(caption))#iu',
'#</?((form)|(button)|(fieldset)|(legend)|(input))#iu',
'#</?((label)|(select)|(optgroup)|(option)|(textarea))#iu',
'#</?((frameset)|(frame)|(iframe))#iu',
'#<[\/\!]*?[^<>]*?>#siu', // Strip out HTML tags
'#<![\s\S]*?--[ \t\n\r]*>#siu' // Strip multi-line comments including CDATA
), array(
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0",
"\n\$0"
), $text);
return strip_tags($text);
}
function check_ratio($url)
{
$file_content = // getting data from curl request here
$page_size = mb_strlen($file_content, '8bit');
$content = strip_html_tags($file_content);
$text_size = mb_strlen($content, '8bit');
$content = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", " ", $content);
$len_real = strlen($file_content);
$len_strip = strlen($content);
return round((($len_strip / $len_real) * 100), 2);
}

why are you reinventing the wheel?
here's the better way:
http://php.net/manual/en/function.strip-tags.php

DOMNode::$textContent can be a starting point:
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents('http://www.google.com'));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName('body');
var_dump($items[0]->textContent);
It also includes data from tags you probably won't consider "text", such as <style> or <script> but it shouldn't be difficult to take that into account.

This is using a regex.
Update 1:
-Have to add an atomic group around the tag body of invisible content,
or could cause catastrophic backtracking if quotes are unbalanced.
-Added list of invisible content it will remove:
script, style, head, object, embed, applet, noframes, noscript, noembed
If no closing tag, just the tag will be removed, otherwise it's content is removed with the tags.
DEMO
Find Raw Regex
<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Replace with nothing.
Various stringed / delimited representations
Delimiter only: /<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/
Single Quote & Delimiter: '/<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|\'[\S\s]*?\'|(?:(?!\/>)[^>])?)+)?\s*>)[\S\s]*?<\/\1\s*(?=>))|(?:\/?[\w:]+\s*\/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|\'[\S\s]*?\'|[^>]?)+\s*\/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>/'
Double Quote only: "<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"
Expanded
# <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| head
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
\s+
(?>
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
[\S\s]*? </ \1 \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| [^>]?
)+
\s* /?
)
| \? [\S\s]*? \?
| (?:
!
(?:
(?: DOCTYPE [\S\s]*? )
| (?: \[CDATA\[ [\S\s]*? \]\] )
| (?: -- [\S\s]*? -- )
| (?: ATTLIST [\S\s]*? )
| (?: ENTITY [\S\s]*? )
| (?: ELEMENT [\S\s]*? )
)
)
)
>
Benchmark:
Regex1: <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
Options: < none >
Completed iterations: 3 / 3 ( x 1000 )
Matches found per iteration: 3780
Elapsed Time: 43.52 s, 43523.08 ms, 43523084 µs
Sample Analysis, page size 126,000 bytes:
3,780 tags / page
x 3,000 iterations
--------------------------
11,340,000 total tags
/ 43.52 seconds
--------------------------
260,569 tags / second
/ 3,780 tags / page
--------------------------
70 pages / second

Related

Regex split string on a char with exception for inner-string

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.
The idea is to get something like this [a, b, "cc | dd", 'ee | ff']
I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467
However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?

This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:
(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*
In PHP this could be:
<?php
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';
$splitted = preg_split($pattern, $string);
print_r($splitted);
?>
And would yield
Array
(
[0] => aa
[1] => bb
[2] => "cc | dd"
[3] => 'ee | ff'
)
See a demo on regex101.com and on ideone.com.

This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:
$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';
$pattern = <<<'PATTERN'
(
(?:[|[]|^) # after | or [ or string start
\s*
(?<token> # name the match
"[^"]*" # string in double quotes
|
'[^']*' # string in single quotes
|
[^\s|]+ # non-whitespace
)
\s*
)x
PATTERN;
preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);
Output:
array(4) {
[0]=>
string(2) "aa"
[1]=>
string(2) "bb"
[2]=>
string(9) ""cc | dd""
[3]=>
string(9) "'ee | ff'"
}
Hints:
The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
I use () as pattern delimiters - they are group 0
Naming matches makes code a lot more readable
Modifier x allows to indent and comment the pattern

Use
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));
See PHP proof.
Results:
Array
(
[0] => aa
[1] => bb
[2] => cc | dd
[3] => ee | ff
)
EXPLANATION
--------------------------------------------------------------------------------
(?| Branch reset group, does not capture:
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^|'\"]+ any character except: '|', ''', '\"'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of grouping

It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to #Jan's answer.
(['"]).*?\1\K| *\| *
PCRE Demo
(['"]) # match a single or double quote and save to capture group 1
.*? # match zero or more characters lazily
\1 # match the content of capture group 1
\K # reset the starting point of the reported match and discard
# any previously-consumed characters from the reported match
| # or
\ * # match zero or more spaces
\| # match a pipe character
\ * # match zero or more spaces
Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

[php, maybe regex]how to remove all strings, except [/] + [sequence of 4 or more numbers] (/1111)

I have a large string stored in a variable (big source code pages), I want everything to be removed, except the
values that are inside the href="HERE"
like this: href="/45214"
it is important that only values with this format be preserved: only one / + numbers, in sequences of 4 or more numbers
expected output:
/45214
I think it's something like this:
'/href=\"(\/)[0-9]/'
$source = '</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">health</span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/7487423'"></span>
</a>';
preg_match_all("/href=\"(\/)[0-9]/", $source, $results);
var_export(end($results));
expected output:
/45214
/7487423
Thanks

You can use DOMDocument to extract all href attribute values, and then check each with a simple '~^/\d{4,}$~' regex that matches
^ - start of string
/ - a slash
\d{4,} - 4+ digits
$ - end of string.
PHP code:
$html = "YOUR_HTML_CODE";
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$results = [];
foreach ($xpath->query('//*/#href') as $val) {
if (preg_match('~^/\d{4,}$~', $val->value)) {
array_push($results, $val->value);
}
}
print_r($results);
Output:
Array
(
[0] => /45214
[1] => /7487423
)
See the PHP demo.

Altho' the OP asks for a PHP solution, since it involves HTML, you could also use JavaScript and a regex as follows:
var d = document;
d.g = d.getElementsByTagName;
var aTags = d.g("a");
var matches = [];
var re = /\/\d{4,}/;
for (var i=0, max = aTags.length; i <= max - 1; i++) {
matches[i] = re.exec(aTags[i].href);
}
d.body.innerHTML="";
console.log(matches);
</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/7487423'"></span>
</a>

Use href=\"(\/)[0-9]{4,} regex, {4,} makes sure to capture 4 or more consecutive numbers.
See example https://regex101.com/r/BlKv9L/1/
$re = '/href=\"(\/)[0-9]{4,}/m';
$str = ' <a href="/45214" title="care with your skin against
<a href="/452143232" title="care with your skin against
<a href="/214" title="care with your skin against
<a href="/543543545214" title="care with your skin against
<a href="/45215434" title="care with your skin against
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);

can be checked here
<(([^<>"]+"){2})*[^<>"]*href="\K[^"]+

scraper series:
You can use preg_match_all() in an efficient way with a regex that is safe for tag parsing.
The nice feature of this is that it won't error out if malformed html
and it won't look for it inside invisible content (such as comments, etc...).
PHP running code
http://sandbox.onlinephpfunctions.com/code/a182a6d57e887d44f9040166cf57fbb3486bb183
<?php
$string = ' HTML ';
preg_match_all
(
'~(?si)(?:<[\w:]+(?=(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)href\s*=\s*(?:([\'"])\s*(/\d{4,})\s*\1))\s+(?:".*?"|\'.*?\'|[^>]*?)+>\K|<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>".*?"|\'.*?\'|(?:(?!/>)[^>])?)+)?\s*>).*?</\3\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:".*?"|\'.*?\'|[^>]?)+\s*/?)|\?.*?\?|(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?))))>(*SKIP)(?!))~',
$string,
$matches,
PREG_PATTERN_ORDER
);
print_r( $matches[2] );
Output
Array
(
[0] => /45214
[1] => /7487423
)
Regex explained
(?si) # Modifier, dot-all and ignore case
(?:
# What we want to examine, any tag with href attribute
< [\w:]+
(?= # Assertion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
(?<= \s )
href \s* = \s* # href attribute
(?:
( ['"] ) # (1), # quote begin
\s*
( # (2 start)
/ \d{4,} # /dddd (slash, 4 or more digits) to be saved
) # (2 end)
\s*
\1 # quote end
)
)
\s+
(?: " .*? " | ' .*? ' | [^>]*? )+
>
\K # Don't store this match, we already have capture group 2 value
|
# OR,
# Match, but skip these (this just advances the current position)
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (3 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (3 end)
(?:
\s+
(?>
" .*? "
| ' .*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
.*? </ \3 \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" .*? "
| ' .*? '
| [^>]?
)+
\s* /?
)
| \? .*? \?
| (?:
!
(?:
(?: DOCTYPE .*? )
| (?: \[CDATA\[ .*? \]\] )
| (?: -- .*? -- )
| (?: ATTLIST .*? )
| (?: ENTITY .*? )
| (?: ELEMENT .*? )
)
)
)
>
(*SKIP)
(?!)
)

Regex match all php tags pairs

How to match all php tags pairs and contents between them.
I using this regexp: /<\?.*\?>(?=[^'])/mgs and my text is
<? $b = '?>';$c = '';$d = '';$e = '';$a = 1;?><div><? $a = 1; ?>`
My regex gives me only first match <? $b = '?>';$c = '';$d = '';$e = '';$a = 1;?>
Actual php code:
$matches = [];
$str = '<? $b = \'?>\';
$c = \'\';$d = \'\';$e = \'\';
$a = 1;?><div><? ?>';
preg_match_all('/<\?.*\?>(?=[^\'])/sm', $str, $matches);
print_r($matches);

My first idea was to use the tokenizer, but after some tests, it seems that this one isn't able to recognize the short open php tag <?.
I decide to write a pattern that describes the two pitfalls: comments and strings.
$str = <<<'EOD'
<? $b = '?>';
$c = '';$d = '';$e = '';
$a = 1;?><div><? ?>';
EOD;
$pattern = <<<'EOD'
~
# subpatterns definitions
(?(DEFINE)
(?<sqs> # single quote string
' [^'\\]*+ (?s: \\. [^'\\]* )*+ '
)
(?<dqs> # double quote string
" [^"\\]*+ (?s: \\. [^"\\]* )*+ "
)
(?<identifier> [a-zA-Z_][a-zA-Z0-9_]* )
(?<hds> # heredoc string
<<< (?| \g<identifier> | " ( \g<identifier> ) " ) \R
(?: .* \R )*?
\g{-1} ;? \R
)
(?<nds> # nowdoc string
<<< '( \g<identifier> )' \R
(?: .* \R )*?
\g{-1} ;? \R
)
(?<str> \g<sqs> | \g<dqs> | \g<hds> | \g<nds> )
(?<slc> // .* ) # singleline comment
(?<mlc> /\* [^*]*+ (?: \* (?!/) [^*]* )*+ (?:\*/)? ) # multiline comment
(?<comment> \g<slc> | \g<mlc> )
)
# main pattern
<\? (?:php)? \s
(?<content>
[^</"'?]*+
(?:
\g<comment> [^</"'?]*
|
\g<str> [^</"'?]*
|
< [^</"'?]*
|
\? (?!>) [^</"'?]*
)*+
)
(?: \?> )?
~x
EOD;
if ( preg_match_all($pattern, $str, $matches) )
print_r($matches['content']);

regex exclude nested html tag

i have a piece of text:
<strong>blalblalba</strong>blasldasdsadasdasd<strong> 3.5m Euros<br>
<span class="style6">SOLD</span></strong>
and I want to remove <strong> contains $|euros|Euros</strong>
So far I have:
preg_replace('#<strong>.*?(^<strong>).*?(\$|euros|Euros|EUROS).*?</strong>#is', '', $result);
but it is not working... I was trying also negative lock head (?!) but still not working...
Any help? Thanks

With the assumption you expect two stong's before your Euros, I think this may be what you want: preg_replace('#^<strong>.*?<strong>.*?(\$[euros|Euros|EUROS]).*?</strong>#is', '', $result);

You can try this, must use 'Dot-All' modifier or substitute [\S\s] -
# <strong>(?:(?!\1)(?:\$|euros|Euros|EUROS)()|(?!<strong>).)+</strong>\1
<strong>
(?:
(?! \1 )
(?: \$ | euros | Euros | EUROS )
( )
|
(?! <strong> )
.
)+
</strong>
\1

php regex: Use quotes for match, but don't capture them

I'm unsure if I should be using preg_match, preg_match_all, or preg_split with delim capture. I'm also unsure of the correct regex.
Given the following:
$string = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
I want to get an array with the following elems:
[0] = "ok"
[1] = "that\'s"
[2] = "yeah that's \"cool\""

You can not do this with a regular expression because you're trying to parse a non-context-free grammar. Write a parser.
Outline:
read character by character, if you see a \ remember it.
if you see a " or ' check if the previous character was \. You now have your delimiting condition.
record all the tokens in this manner
Your desired result set seems to trim spaces, you also lost a couple of the \s, perhaps this is a mistake but it can be important.
I would expect:
[0] = " ok " // <-- spaces here
[1] = "that\\'s cool"
[2] = " \"yeah that's \\\"cool\\\"\"" // leading space here, and \" remains

Actually, you might be surprised to find that you can do this in regex:
preg_match_all("((?|\"((?:\\\\.|[^\"])+)\"|'((?:\\\\.|[^'])+)'|(\w+)))",$string,$m);
The desired result array will be in $m[1].

You can do it with a regex:
$pattern = <<<'LOD'
~
(?J)
# Definitions #
(?(DEFINE)
(?<ens> (?> \\{2} )+ ) # even number of backslashes
(?<sqc> (?> [^\s'\\]++ | \s++ (?!'|$) | \g<ens> | \\ '?+ )+ ) # single quotes content
(?<dqc> (?> [^\s"\\]++ | \s++ (?!"|$) | \g<ens> | \\ "?+ )+ ) # double quotes content
(?<con> (?> [^\s"'\\]++ | \s++ (?!["']|$) | \g<ens> | \\ ["']?+ )+ ) # content
)
# Pattern #
\s*+ (?<res> \g<con>)
| ' \s*+ (?<res> \g<sqc>) \s*+ '?+
| " \s*+ (?<res> \g<dqc>) \s*+ "?+
~x
LOD;
$subject = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
var_dump($match['res']);
}
I made the choice to trim spaces in all results, then " abcd " will give abcd. This pattern allows all backslashes you want, anywhere you want. If a quoted string is not closed at the end of the string, the end of the string is considered as the closing quote (this is why i have made the closing quotes optional). So, abcd " ef'gh will give you abcd and ef'gh

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Text to html ratio of a page issue - php

why are you reinventing the wheel? here's the better way: http://php.net/manual/en/function.strip-tags.php

Related

Regex split string on a char with exception for inner-string

[php, maybe regex]how to remove all strings, except [/] + [sequence of 4 or more numbers] (/1111)

Regex match all php tags pairs

regex exclude nested html tag

php regex: Use quotes for match, but don't capture them

Categories

Resources