Is there a regular expression that can match any of the following?
'<'+'script>'
'<s'+'cript>'
'<script'+'>'
'</'+'script>'
'</scr' + 'ipt>'
'<script></scrip'+'t>'
'<script type=text/javascript src="http://..."></scrip'+'t>'
I need to do this because HTML Tidy is producing errors if I have these strings in the HTML. I want remove them using preg_replace().
wow, interesting, but i think a parser of sorts would be a more reliable solution.
the following regex is bit of an abomination but it'll match what you what:
'</?(?:'\+')?(?=s).+(?=c).(?=r).+(?=i).+(?=p).+(?=t).+>'
it will also match a variety of tags that you don't want, i leave that to you:
'<scdcdacacapt type=text/javascript src="http://..."></cdscdcss'+'t>'
this is because of the javascript string in the type attribute, so if you have the word javascript inside any tag it'll match :(
hopefully it's a starting point for you
Use '\x3cscript\x3e' instead of '<script>'.
Related
I need to match parts of string whilst ignoring HTML tags. Which means if user wants to look for string "foo and foo1" in source code.
Two strings, <u>foo</u> and foo1
He'd not get the match, because of the tags.
I've tried regex, but since the tags can and don't have to be there, it seems rather too complicated.
It's not server-side script. It'd be an application run from console.
To be more specific: it is for syntax highlight. So user wants "foo and foo1" to be italic, but part of it is already underline and wouldn't match anyway. That's why I can't strip the string.
Use the PHP function strip_tags to remove the HTML tags from the text. Then do your search.
http://php.net/manual/en/function.strip-tags.php
Use strip_tags as you have been advised, it is really the best way. However, if you want to have fun or experiment and benchmark your regex engine :) you can insert (?:<\/?[^>]+>)? after each symbol of the query passed, and you will have a match, and in the very beginning of the query (or the opening tag won't be captured).
Here is an example for a "foo and foo1":
(?:<\/?[^>]+>)?f(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)? (?:<\/?[^>]+>)?a(?:<\/?[^>]+>)?n(?:<\/?[^>]+>)?d(?:<\/?[^>]+>)? (?:<\/?[^>]+>)?f(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?1(?:<\/?[^>]+>)?
This will match <u>foo</u> and foo1.
https://regex101.com/r/aF8fJ8/4
This regex will ignore the <> and slashes in html tags, only extracting words.
(?!</?[^>]+>)([a-zA-Z]+)
just replace the [a-zA-Z]+ with what you want to match.
I need to scrape some data from a website. For that I am using preg_match, but I am not able to write the regex for it. The data on the website is
title="Russia"/></a>
<small>*</small> <a href="/profile/roman
I have written the regex as #title=\"Russia\"\/><\/a>((\n|\r)*)<small>*<\/small> <a href=\"/profile/(.+?)\"#sx
But this is not working and I dont know why ? When I echo my regex it says #title="Russia"\/><\/a>(( | )*)*<\/small> . Where are the others gone? And why is it not working ?
Try this:
#title=\"Russia\"/></a>(\s*)<small>\*</small>\s+<a\s+href=\"/profile/(.+?)\"#sx
I have escaped the * because its a metacharacter. Without it, you would match strings containing the word small followed by zero or more >s.
You really should not use regexes to evaluate markup content, especially when you acquire it by scrapping pages.
In your case there are at least three reasons that might be responsible for breaking your regex.
Do not attempt to write your own whitespace evaluators when you can simply use \s which stands for "any whitespace character"
In regular expressions asterisk (*) has a special meaning which is why you can't simply use it to identify asterisks. If you want to collect content inside the small attribute you should use <small>(.*)</small> instead. If on the other hand you are actually expecting an asterisk then you have to escape it like this <small>\*</small>.
Your regex expects a closing quote for your href attribute on that last <a> but in your sample markup you have none. Provided that on the original page you do have a closing quote the following regex should do the trick.
#title=\"Russia\"\/><\/a>(\s*)<small>\*</small> <a href="/profile/(.+)?\"#sx
However once again I have to advise using a DOM parser like DOMDocument for this not only because it is much more reliable when handling markup content but also because it can interpret bad markup as well (if its loaded as HTML of course).
Ok, so here's my issue:
I have a link, say: http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV
And the link is between two tags say like this:
<br>http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV<br></p>
Using this regex with preg_replace:
'#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i'
As such:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "***",$strText);
The resulted string is :
<br***p>
Which is wrong!!
It should have been
<br>***<br></p>
How can I get the desired result? I have blasted my head out trying to solve this one out.
I would like to mention that str_replace replaces even the link within another valid link, so it's not a good method, I need an exact match between two boundaries, even if the boundary is text or another HTML tag.
Assuming you don't want to use a DOM parser for some reason, I believe doing what you intended is as simple as the following:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "$1***$3",$strText);
This uses $1 and $3 to put back the delimiting text you matched in your regular expression.
As others have pointed out, using a DOM parser is more reliable.
Does this do what you want?
I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues
I need to parse and return the tagname and the attributes in our PHP code files:
<ct:tagname attr="attr1" attr="attr2">
For this purpose the following regular expression has been constructed:
(\<ct:([^\s\>]*)([^\>]*)\>)
This expression works as expected but it breaks when the following code is parsed
<ct:form/input type="attr1" value="$item->field">
The original regular expression breaks because of the > character in the $item->field. I would need to construct a regular expression that ignores the -> or => but not the single >.
I am open to any suggestions... Thanks for your help in advance.
Try this:
<ct:([^\s\>]*)((?:\s+\w+\s*=\s*(?:"[^"]*"|'[^']*')\s*)*)>
But if that’s XML, use should better use a XML parser.
You could try using negative lookbehind like that:
(\<ct:([^\s\>]*)(.*?)(?<!-|=)\>)
Matches :
<ct:tagname attr="attr1" attr="attr2">
<ct:form/input type="attr1" value="$item->field">
Not sure that it the best suited solution for your case, but that respects the constraints.
In general, any parsing problem rapidly runs into language constructs that are context-free but not regular. It may be a better[1] solution to write a context-free parser, ignoring everything except the elements you're interested in.
[1] "better" as seen from a viewpoint of Being The Right Thing, not necessarily a return on investment one.
I think what you want to do is not recognize the -> and =>, but ignore everything between pairs of quotes.
I think it can be done by inserting ((
("[^"]*")*
)) at the opportune place.
My suggestion is to match to the attributes in the same expression.
\<ct:([^\s\>]*)((([a-x0-9]+)=\"([^\"]*)\")*)\>
edit: removed part about > not being valid xml in attribute values.