Regex in preg_replace to detect url format and extract elements

Regex in preg_replace to detect url format and extract elements - php

I need to replace certain user-entered URLs with embedded flash objects...and I'm having trouble with a regex that I'm using to match the url...I think mainly because the URLs are SEO-friendly and therefore a bit more difficult to parse
URL structure: http://www.site.com/item/item_title_that_can_include_1('_etc-32CHARACTERALPHANUMERICGUID
I need to both detect a match of an URL in that format and capture the 32CHARACTERALPHANUMERICGUID which is always placed after the - in the url
something like this:
$ret = preg_replace('#http://www\.site\.com/item/([^-])-([a-zA-Z0-9]+)#','<embed>itemid=$2</embed>', $ret);
For some reason, the above does not find a match for an URL in the specified format. I'm new to regexes, so I think I'm missing something fairly obvious.

You should check out parse_url().
Examine the results - it was made for parsing URLs. You'll be able to extract the data you require from the tokens returned.
If you are regex crazy, try this...
/^http:\/\/www\.site\.com\/item\/[^-]*\-([a-zA-Z0-9]{32})$/
Your example is almost there, but...
When you do the not character range, i.e. [^-], you still need a quantifier. I placed *, or 0 or more.
You don't seem to use the item title, so we won't bother capturing it.
You should use beginning (^) and end ($) anchors if the string is always exactly like that.
You say the GUID is 32 chars, so we may as well explicitly state that with the {32} quantifier.

Related

How to remove backpath/parentpath from the URL?

Input:
http://foo/bar/baz/../../qux/
Desired Output:
http://foo/qux/
This can be achieved using regular expression (unless someone can suggest a more efficient alternative).
If it was a forward look-up, it would be as simple as:
/\.\.\/[^\/]+/
Though I am not familiar with with how to make a backward look up for the first "/" (ie. not doing /[a-z0-9-_]+\/\.\./).
One of the solutions I thought of is to use strrev then apply forward look up regex (first example) and then do strrev. Though I am sure there is a more efficient way.

Not the clearest question I've ever seen, but if I understand what you're asking, I think you only need to switch around what you have like this:
/[^\/]+/\.\./
...then replace that with a /
Do that until no replacements are made and you should have what you want
EDIT
Your attempt seems to try to match a forward slash / and two dots \.\. followed by a slash / (or \/ - they should both match the same thing), then one or more non-slash characters[^/]+, terminated by a slash /. Flipping it around, you want to find a slash followed by one or more non-slash characters and a terminating slash, then two dots and a final slash.
You may be confused into thinking that the regex engine parses and consumes things as it goes (so you wouldn't want to consume a directory name that is not followed by the correct number of dots), but that's not how it typically works - a regex engine matches the entire expression before it replaces or returns anything. So, you can have two dots followed by a directory name, or a directory name followed by two dots - it doesn't make a difference to the engine.
If your attempt is using the slash-enclosed Perl-style syntax, then you would of course need to use \/ for any slashes you're trying to match such as the middle one, but I would also recommend matching and replacing the enclosing slashes in the url as well: I think the PHP would be something like
preg_replace('/\/[^\/]+\/\.\.\//', '/', $input)
(??)

Technically what do you want is replace segments of '/path1/path2/../../' by '/' what is needed to do that is match 'pathx/'^n'../'^n that is definetly NOT a regular expression (Context Free Lenguaje) ... but most of Regex libraries supports some non regular lenguajes and can (with a lot of effort) manage those kind of lenguajes.
An easy way to solve it is stay in Regular Expressions and cycle several times, replacing '/[^./]+/../' by ''
if you still to do it in a single step, Lookahead and grouping is needed, but it will be hard to write it, (I'm not so used on, but I will try)
EDIT:
I've found the solution in only 1 REGEX... but should use PCRE Regex
([^/.]+/(?1)?\.\./)
I've based my solution on the folowing link:
Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)
(note that dots are "forbidden" in the first section, you cannot have path.1/path.2/ if you whant to is quite more complex because you should admit them but forbid '../' as valid in the first section
this sub expression is for admiting the path names like 'path1/'
[^/.]+/
this sub expression is for admiting the double dots.
\.\./
you can test the regexp in
https://www.debuggex.com/
(remember to set it in PCRE mode)
Here is a working copy:
https://eval.in/52675

multiple text replace inside a string, while keeping the selected variable

I have a web script that creates a HTML page into a PHP string, then delivers it to the user. All of the pages are generated by index.php, with a unique url.
domain.host.com/index.php?loadpage=/BLAH
The homepage is static HTML, but every other page is dynamically generated into this PHP string. It may seem like im rambling, just trying to give as much info as possible. I have created a javascript code to modify the link url:
BLAH Link
This basically shows the nice neat link in the status bar, but the javascript sends it to the URL i want (I have no need to modify the url bar, as this is in an iframe)
These links are fine on the static page. But on the dynamically generated page thats in the PHP string is a little harder. I need to search through a string for every occurence of:
href="?loadpage=/ [WILDCARD] " title=
and replace it with:
href="http://domain.com/ [WILDCARD] " onclick="location.href='?loadpage=/ [WILDCARD] '; return false;" title=
This seems very complicated to me and I think it could be ereg / preg match / replace, but have no clue with regex.
In a short summary, I need some way of searching through a php string that contains the full page html, and replacing the first string with the second (on every occurance of a link with '?loadpage/'. But each link will have a different [WILDCARD] so i'm presuming, that the script will need to find every occurance, save the [WILDCARD] to a variable, then do the replace function, and insert the word its just saved as a variable from the first url.
EDIT.
Just to clarify what the original link looks like:
<a id="random" href="?loadpage=/BLAH" title="BLAH Title"></a>
this is why i am only searching from the href attribute.

You are right, what you need is a regex. (Your need for a wildcard replace is the clue). This answer is not supposed to be a complete solution, just give you an idea how regexes work. I will leave it to you to integrate this with php (try preg_match_all)
This is the pattern you want to match:
"\?loadpage=\/([^"]*)"
The \ is an escape for characters that have special meaing in regexes
So ignoring the escapes this is
"?loadpage=/ //the start of the string up to the wildcard part
() // capturing parentheses, indicating a part that
// you want to access in the replace string
[^"]* // any number of occurences of any character that is NOT doublequote
// ^ is the negation symbol
// * indicates "zero or more occurrences"
followed by...
" doublequote character
Now you need a replacement string ... for this you just need to know that your (capture parentheses) allow you to recall that part of the match. In most regex flavours your can capture these to a series numbered variables, usually represented as $1, $2, $3.. \1 \2 \3... In your case you only have one capture variable to deal with.
So you replacement string could look like
"http://domain.com/$1/" onclick="location.href='?loadpage=/$1'; return false"
In perl you would put the whole thing together like this:
$string =~ s|"\?loadpage=\/([^"]*)"|"http://domain.com/$1/" onclick=\"location.href='?loadpage=/$1'\; return false"|g;
Note that you don't need to escape your quotemarks. This may differ in php.
As you will see it easily gets very cryptic. regular-expressions.info is a useful online reference.
just so you know what you are looking at (you won't need to do this in php)...
=~ is the perl regex operator (you won't use this in php, take a look at the preg_match documentation)
then you have the form
s|match_pattern|replace_pattern|g;
where s indicates replacement (as opposed to simple matching)
g indicates global matching (otherwise process will stop on first match)
||| are the separators. Usually written /// but then you would have to escape all of your URL //s, which doubles the illegibility.
But this is now too much perl-specifc detail, read the php regex docs!

Rapidshare URL not matching correctly

I'm trying to make sure that a Rapidshare URL is valid when a user submits it through my form.
This is the regex that I've come up with so far:
http://rapidshare.com/files/[0-9]+/[a-zA-Z0-9\._-]+
A rapidshare link looks like this:
http://rapidshare.com/files/168501977/some_random-file.zip
My pattern matches, but not entirely correctly. For example, if we use this input:
http://rapidshare.com/files/168501977/some_random-file.zip£%^$
It will still match using the PHP function preg_match(), and let it go through, even though there are illegal symbols on the end of the URL. I want the pattern to match the entire input, and not just a random length that matches.
Any help would be appreciated, cheers!

You need to anchor the regex pattern. Use ^ to anchor the beginning and $ to anchor the end. So the pattern becomes:
^http://rapidshare.com/files/[0-9]+/[a-zA-Z0-9\._-]+$
This prevents a partial match of the string like the example is generating.

Validate the start and the end of your string using ^ and $. Example:
^ht{2}p:\/{2}rapidshare\.com\/files\/\d+\/[\.a-zA-Z_-]+$

Need a good regex to convert URLs to links but leave existing links alone

I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.
I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?

Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get
(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...

This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:
(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
With the following input:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
This is the output of a preg_replace:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
Just wanted to contribute back to save somebody some time.

I made a slight modification to the Regex contained in the original answer:
(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:
$convertedText = preg_replace( '#(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]#i', '\0', $originalText );
Note, I removed # from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that # would be used in a URL anyway.
Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.
Hope that helps.

To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:
/(?<!href=")http://\S*/
Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.

if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[A-Z0-9+&##\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
# Successful match
} else {
# Match attempt failed
}

Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.
The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.
All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).
Beware, long regex ahead. Apply case-insensitively.
(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,<>"'\s\r\n\t]+(?:\.(?![.<>"'\s\r\n])[^.,!<>"'\s\r\n\t]+)+)
Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.
The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.
Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:
Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
Scan for incorrectly nested <a> tags, removing the innermost one

Regular Expression to Detect a Specific Query

I wonder if you anyone can construct a regular expression that can detect if a person searches for something like "site:cnn.com" or "site:www.globe.com.ph/". I've been having the most difficult time figuring it out. Thanks a lot in advance!
Edit: Sorry forgot to mention my script is in PHP.

Ok, for input into an arbitary text field, something as simple as the following will work:
\bsite:(\S+)
where the parentheses will capture whatever site/domain they're trying to search. It won't verify it as valid, but validating urls/domains is complex and there are many easily googlable regexes for doing that, for instance, there's one here.

What are you matching against? A referer url?
Assuming you're matching against a referer url that looks like this:
http://www.google.com/search?client=safari&rls=en-us&q=whatever+site:foo.com&ie=UTF-8&oe=UTF-8
A regex like this should do the trick:
\bsite(?:\:|%3[aA])(?:(?!(?:%20|\+|&|$)).)+
Notes:
The colon after 'site' can either be unencoded or it can be percent encoded. Most user agents will leave it unencoded (which I believe is actually contrary to the standard), but this will handle both
I assumed the site:... url would be right-bounded by the equivalent of a space character, end of field (&) or end of string ($)
I didn't assume x-www-form-urlencoded encoding (spaces == '+') or spaces encoded with percent encoding (space == %20). This will handle both
The (?:...) is a non-capturing group. (?!...) is a negative lookahead.

no it's not for a referrer url. My php script basically spits out information about a domain (e.g. backlinks, pagerank etc) and I need that regex so it will know what the user is searching for. If the user enters something that doesn't match the regex, it does a regular web search instead.

If this is all you are trying to do, I guess I'd take the more simple approach and just do:
$entry = $_REQUEST['q'];
$tokens = split(':', trim($entry));
if (1 < count($tokens) && strtolower($tokens[0]) == 'site')
$site = $tokens[1];

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex in preg_replace to detect url format and extract elements - php

Related

How to remove backpath/parentpath from the URL?

multiple text replace inside a string, while keeping the selected variable

Rapidshare URL not matching correctly

Need a good regex to convert URLs to links but leave existing links alone

Regular Expression to Detect a Specific Query

Categories

Resources