I have a web script that creates a HTML page into a PHP string, then delivers it to the user. All of the pages are generated by index.php, with a unique url.
domain.host.com/index.php?loadpage=/BLAH
The homepage is static HTML, but every other page is dynamically generated into this PHP string. It may seem like im rambling, just trying to give as much info as possible. I have created a javascript code to modify the link url:
BLAH Link
This basically shows the nice neat link in the status bar, but the javascript sends it to the URL i want (I have no need to modify the url bar, as this is in an iframe)
These links are fine on the static page. But on the dynamically generated page thats in the PHP string is a little harder. I need to search through a string for every occurence of:
href="?loadpage=/ [WILDCARD] " title=
and replace it with:
href="http://domain.com/ [WILDCARD] " onclick="location.href='?loadpage=/ [WILDCARD] '; return false;" title=
This seems very complicated to me and I think it could be ereg / preg match / replace, but have no clue with regex.
In a short summary, I need some way of searching through a php string that contains the full page html, and replacing the first string with the second (on every occurance of a link with '?loadpage/'. But each link will have a different [WILDCARD] so i'm presuming, that the script will need to find every occurance, save the [WILDCARD] to a variable, then do the replace function, and insert the word its just saved as a variable from the first url.
EDIT.
Just to clarify what the original link looks like:
<a id="random" href="?loadpage=/BLAH" title="BLAH Title"></a>
this is why i am only searching from the href attribute.
You are right, what you need is a regex. (Your need for a wildcard replace is the clue). This answer is not supposed to be a complete solution, just give you an idea how regexes work. I will leave it to you to integrate this with php (try preg_match_all)
This is the pattern you want to match:
"\?loadpage=\/([^"]*)"
The \ is an escape for characters that have special meaing in regexes
So ignoring the escapes this is
"?loadpage=/ //the start of the string up to the wildcard part
() // capturing parentheses, indicating a part that
// you want to access in the replace string
[^"]* // any number of occurences of any character that is NOT doublequote
// ^ is the negation symbol
// * indicates "zero or more occurrences"
followed by...
" doublequote character
Now you need a replacement string ... for this you just need to know that your (capture parentheses) allow you to recall that part of the match. In most regex flavours your can capture these to a series numbered variables, usually represented as $1, $2, $3.. \1 \2 \3... In your case you only have one capture variable to deal with.
So you replacement string could look like
"http://domain.com/$1/" onclick="location.href='?loadpage=/$1'; return false"
In perl you would put the whole thing together like this:
$string =~ s|"\?loadpage=\/([^"]*)"|"http://domain.com/$1/" onclick=\"location.href='?loadpage=/$1'\; return false"|g;
Note that you don't need to escape your quotemarks. This may differ in php.
As you will see it easily gets very cryptic. regular-expressions.info is a useful online reference.
just so you know what you are looking at (you won't need to do this in php)...
=~ is the perl regex operator (you won't use this in php, take a look at the preg_match documentation)
then you have the form
s|match_pattern|replace_pattern|g;
where s indicates replacement (as opposed to simple matching)
g indicates global matching (otherwise process will stop on first match)
||| are the separators. Usually written /// but then you would have to escape all of your URL //s, which doubles the illegibility.
But this is now too much perl-specifc detail, read the php regex docs!
Related
This question already has answers here:
How do I replace certain parts of my string?
(5 answers)
Closed 2 years ago.
I'm creating a simple comment system connected by Steam API. Every Steam user connected in my website can automatically post things. But i'm changing some functions to replace things like the URLs.
My question is: When a user post something like,
"Hello I'm nice, have a look at http://www.cute.com"
Automatically replaces the http:// for the link without changing the http:// in the string.
Maybe something like this?
<?php
$str = "helloo im nice, have a look http://www.cute.com";
echo preg_replace("/http:\/\/(.+)\.(.+)\.(.+)/", "<a href='http://$1.$2.$3'>$1.$2.$3</a>", $str);
?>
This will convert any link into an anchor (or an a tag).
Alternative added
Alternatively, it might be a good idea to add support for https as well. In which case the following might be useful.
<?php
$str = "helloo im nice, have a look http://www.cute.com";
echo preg_replace("/http(s?):\/\/(.+)\.(.+)\.(.+)/", "<a href='http$1://$2.$3.$4'>http$1://$2.$3.$4</a>", $str);
?>
This takes advantage of the ? modifier which means "one or more of the preceding character". In this case it is the "s" character since it is "http" and "https" both match.
Explanation
This uses RegEx (or Regular Expressions) to create this.
The first parameter of the preg_replace function takes the RegEx (I like to test mine here: http://regexr.com/).
All RegExs must start and end with a forward slash. The bits inbetween are as follows.
http: is simply selecting a string that starts with "http:"
\/\/ is called "escaping" and that will select two forward slashes. Since forward slashes are special characters used in RegEx (start and end of a statement) they need to be escaped so that PHP doesn't think the RegEx has ended sooner.
(.+) The brackets are also special characters (though not escaped) and they are known as "capture groups". What this is used for is so that I can see what is between the "http://" and the ".com" (or whatever extension is used). The full stop (or period or ".") character selects anything.
\. Further on the escaping. Since full stop is used as a special character, we have to escape this one. What that means so far is that we are selecting "http://" then anything and then stopping at a full stop.
(.+) Last but not least is the final capture group. This, again selects anything from the string so that have our final capture group and RegEx complete.
Modifiers:
? means "one or more of the preceding character". This means that /tests?/ would match test and tests since s is the preceding character and in the first example we have 0 and in the second there is 1
+ means "one of more of the preceding character". In this case we are saying one of more of anything which means we expect at least one character to be provided.
The second parameter is our replace part.
In short, the $1 and $2 sections are to reference the two brackets from the above RegEx.
Some further reading
The PHP function I used
More information on Regular Expressions
RegEx capture groups
$string = 'helloo im nice, have a look http://www.cute.com';
$string = str_replace('http://', '', $string);
echo $string;
I'm trying to find a regex that will match each specific tag that contains ../.
I had it matching when each element was on its own line. But then there was an instance where my HTML rendered on one line causing the regex to match the whole line:
<body><img src="../../../img.png"><img src="../../img.png"><img src="../../img.png"><img src="..//../img.png"><img src="..../../img.png">
Here was the regex that I was using
<.*[\.]{2}[\/].*>
You need to make sure to match only one tag per match.
Using a negative character class like below will accomplish that.
<[^>]*\.\./[^>]*>
< = start of tag
[^>]* = any number of characters that aren't >, since > would end the tag
\.\./ = "../" with escapes for the . characters
[^>]* = same as above
> = end of tag
It appears you might be doing this to prevent path parenting. You should know that for a URL attribute in an HTML tag, the following tags are considered "equivalent":
<img src="../foo.jpg">
<img src="%2e%2e%2ffoo.jpg">
<img src="../foo.jpg">
That's because the src attribute goes through HTML entity un-escaping, and then URL un-escaping (in that order) before being used. As a result, there are 5,832 different ways to write '../' into an HTML tag's path attribute (18 ways to write each character times 3 characters).
Making a regex to match any of these encodings of ../ is more difficult, but still possible.
(\.|.|(%|%)(2|2)([Ee]|E|e)){2}(/|/|(%|%)(2|2)([Ff]|F|f))
For reference:
. = . HTML escape sequence
/ = / HTML escape sequence
%2E or %2e = . URL escape sequence
%2F or %2f = / URL escape sequence
% = % HTML escape sequence
2 = 2 HTML escape sequence
E = E HTML escape sequence
e = e HTML escape sequence
F = F HTML escape sequence
f = f HTML escape sequence
You can see why people usually say it's better to use a real HTML parser, instead of regex!
Anyway, assuming yo need this, and a full HTML parser isn't feasable, here's the version of <[^>]*[="'/]\.\./[^>]*> that also catches HTML and URL escaping:
<[^>]*[="'/](\.|.|(%|%)(2|2)([Ee]|E|e)){2}(/|/|(%|%)(2|2)([Ff]|F|f))[^>]*>
Causing the regex to match the whole line seems you are regex is greedy, try this way as #Avinash Raj commented.
SEE DEMO
To get the regexp you want I will try to follow a step by step approach:
First, we need some regex that matches the beginning and end of the tag. But we must be carefull, as the tag end character > is allowed in single and double quote strings. We construct first the regexp that matches these single/double quoted strings: ([^"'>]|"[^"]*"|'[^']*')* (a sequence of: non-quote (single and double) and non end tag character, or a single quoted string, or a double quoted string)
Now, modify it to match a single quoted string or a double quoted string that includes a ../: ([^"'>]|"[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')* (we can simplify it, eliminating the last * operator, as we will match the whole string with only one matching ../ inside, and we can eliminate the first option, as we will have the ../ seq inside quoted strings). We get to: ("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')
To get a string matching a sequence including at least one of the second strings, we concatenate the first regex at the beginning and at the end, and the other in the middle. We get to: ([^"'>]|"[^"]*"|'[^']*')*("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')([^"'>]|"[^"]*"|'[^']*')*
Now, we only need to surround this regexp with the needed sequences first <[iI][mM][gG][ \t\n], and after >, getting to:
<[iI][mM][gG][ \t\n]([^"'>]|"[^"]*"|'[^']*')*("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')([^"'>]|"[^"]*"|'[^']*')*>
This is the regexp we need. See demo If we extract the content of the second group ($2, \2, etc.) we'll get to the parameter value that matches (with the quotes included) the ../ string.
Don't try to simplify this further as > characters are allowed inside single and double quoted strings, and " are allowed in single quoted strings, and ' are in double quoted strings. As someone explained in another answer to this question, you cannot be greedy (using .* inside, as you'll eat as much input as possible before matching) This regexp will need to match multiline tags, as these could be part of your input file. If you have a well formed HTML file, then you'll have no problem with this regexp.
And some final quoting: an HTML tag is defined by a grammar that is regular (it is only a regular subset of the full HTML syntax), so it is perfectly parseable with a regex (the same is not true for the complete HTML language) A regex is by far more efficient and less resource consuming than a full HTML parser. The caveats are that you have to write it (and to write it well) and that HTML parsers are easily found with some googling that avoid you the work of doing it, but you have to write it only once. Regexp parsing is a one pass process that grows in complexity (for this example, at least) linearly with input text length. You'll be advised against this by people that simply don't know how to write the right regexp or don't know how to determine is some grammar is regular.
Note:
This regexp will match commented tags. In case you don't want to match commented <img> tags, you'll have to extend your regexp a little or do a two pass to eliminate comments first, and then parse tags (the regexp that only recognizes uncommented tags is far more complicated than this) Also, look below for more difficulties you can have on your task to eliminate parent directory references.
Note 2:
As I have read in your comments to some answers, the problem you want to solve (eliminating .. references in HTML/XML sources) is not regular. The reason is that you can have . and .. references embedded in the path strings. Normally, one must proceed eliminating the /. or ./ components of the path, getting a path without . (actual directory) references. Once you have this, you have to eliminate a/.. references, where a is distinct of ... This deals to eliminating occurrences of a/.., a/b/../.., etc. But the language that matches a^i b^i is not regular (as demonstrated by the pumping lemma ---see google) and you'll need a context independent grammar.
Note 3:
If you limit the number of a/b/c/../../.. levels to some maximum bound, you're still able to find a regexp to match this kind of strings, but you can have one example that breaks your regexp and makes it invalid. Remember, you first have to eliminate the single dot . path component (as you can have something like a/b/./././c/./d/.././e/f/.././../... You will first eliminate the single dot components, leading to: a/b/c/d/../e/f/../../../... Then you proceed by pairs of <non ..>/.., getting a/b/c/[d/..]/e/f/../../../.. to a/b/c/e/[f/..]/../../.. -> a/b/c/[e/..]/../.. -> a/b/[c/..]/.. -> a/[b/..] -> a (you ought to check that all the first components of a pair do exist before being eliminated to be precise) and if you get to an empty path, you will have to change it to . to be usable.
I have code to do this process, but it's embedded in some bigger program. If you are interested, you can access this code. (look at the rel_path() routine here)
You cannot eliminate a .. element at the beginning of a path (better, that has not a <non ..> counterpart), as it refers to outside of the tree, making the reference dependant on the external structure of the tree.
I'm trying to make sure that a Rapidshare URL is valid when a user submits it through my form.
This is the regex that I've come up with so far:
http://rapidshare.com/files/[0-9]+/[a-zA-Z0-9\._-]+
A rapidshare link looks like this:
http://rapidshare.com/files/168501977/some_random-file.zip
My pattern matches, but not entirely correctly. For example, if we use this input:
http://rapidshare.com/files/168501977/some_random-file.zipĀ£%^$
It will still match using the PHP function preg_match(), and let it go through, even though there are illegal symbols on the end of the URL. I want the pattern to match the entire input, and not just a random length that matches.
Any help would be appreciated, cheers!
You need to anchor the regex pattern. Use ^ to anchor the beginning and $ to anchor the end. So the pattern becomes:
^http://rapidshare.com/files/[0-9]+/[a-zA-Z0-9\._-]+$
This prevents a partial match of the string like the example is generating.
Validate the start and the end of your string using ^ and $. Example:
^ht{2}p:\/{2}rapidshare\.com\/files\/\d+\/[\.a-zA-Z_-]+$
I need to replace certain user-entered URLs with embedded flash objects...and I'm having trouble with a regex that I'm using to match the url...I think mainly because the URLs are SEO-friendly and therefore a bit more difficult to parse
URL structure: http://www.site.com/item/item_title_that_can_include_1('_etc-32CHARACTERALPHANUMERICGUID
I need to both detect a match of an URL in that format and capture the 32CHARACTERALPHANUMERICGUID which is always placed after the - in the url
something like this:
$ret = preg_replace('#http://www\.site\.com/item/([^-])-([a-zA-Z0-9]+)#','<embed>itemid=$2</embed>', $ret);
For some reason, the above does not find a match for an URL in the specified format. I'm new to regexes, so I think I'm missing something fairly obvious.
You should check out parse_url().
Examine the results - it was made for parsing URLs. You'll be able to extract the data you require from the tokens returned.
If you are regex crazy, try this...
/^http:\/\/www\.site\.com\/item\/[^-]*\-([a-zA-Z0-9]{32})$/
Your example is almost there, but...
When you do the not character range, i.e. [^-], you still need a quantifier. I placed *, or 0 or more.
You don't seem to use the item title, so we won't bother capturing it.
You should use beginning (^) and end ($) anchors if the string is always exactly like that.
You say the GUID is 32 chars, so we may as well explicitly state that with the {32} quantifier.
I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.
I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?
Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get
(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...
This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:
(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
With the following input:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
This is the output of a preg_replace:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
Just wanted to contribute back to save somebody some time.
I made a slight modification to the Regex contained in the original answer:
(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:
$convertedText = preg_replace( '#(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]#i', '\0', $originalText );
Note, I removed # from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that # would be used in a URL anyway.
Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.
Hope that helps.
To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:
/(?<!href=")http://\S*/
Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.
if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[A-Z0-9+&##\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
# Successful match
} else {
# Match attempt failed
}
Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.
The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.
All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).
Beware, long regex ahead. Apply case-insensitively.
(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,<>"'\s\r\n\t]+(?:\.(?![.<>"'\s\r\n])[^.,!<>"'\s\r\n\t]+)+)
Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.
The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.
Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:
Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
Scan for incorrectly nested <a> tags, removing the innermost one