PHP URL to Link with Regex

PHP URL to Link with Regex - php

I know I've seen this done a lot in places, but I need something a little more different than the norm. Sadly When I search this anywhere it gets buried in posts about just making the link into an html tag link. I want the PHP function to strip out the "http://" and "https://" from the link as well as anything after the .* so basically what I am looking for is to turn A into B.
A: http://www.youtube.com/watch?v=spsnQWtsUFM
B: www.youtube.com
If it helps, here is my current PHP regex replace function.
ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]", "\\0", htmlspecialchars($body, ENT_QUOTES)));
It would probably also be helpful to say that I have absolutely no understanding in regular expressions. Thanks!
EDIT: When I entered a comment like this blahblah https://www.facebook.com/?sk=ff&ap=1 blah I get html like this<a class="bwl" href="blahblah https://www.facebook.com/?sk=ff&ap=1 blah">www.facebook.com</a> which doesn't work at all as it is taking the text around the link with it. It works great if someone only comments a link however. This is when I changed the function to this
preg_replace("#^(.*)//(.*)/(.*)$#",'<a class="bwl" href="\0">\2</a>', htmlspecialchars($body, ENT_QUOTES));

This is the simples and cleanest way:
$str = 'http://www.youtube.com/watch?v=spsnQWtsUFM';
preg_match("#//(.+?)/#", $str, $matches);
$site_url = $matches[1];
EDIT: I assume that the $str had been checked to be a URL in the first place, so I left that out. Also, I assume that all the URLs will contain either 'http://' or 'https://'. In case the url is formatted like this www.youtube.com/watch?v=spsnQWtsUFM or even youtube.com/watch?v=spsnQWtsUFM, the above regexp won't work!
EDIT2: I'm sorry, I didn't realize that you were trying to replace all strings in a whole test. In that case, this should work the way you want it:
$str = preg_replace('#(\A|[^=\]\'"a-zA-Z0-9])(http[s]?://(.+?)/[^()<>\s]+)#i', '\\1\\3', $str);

I am not a regex whizz either,
^(.*)//(.*)/(.*)$
\2
was what worked for me when I tried to use as find and replace in programmer's notepad.
^(.)// should extract the protocol - referred as \1 in the second line.
(.)/ should extract everything till the first / - referred as \2 in the second line.
(.*)$ captures everything till the end of the string. - referred as \3 in the second line.
Added later
^(.*)( )(.*)//(.*)/(.*)( )(.*)$
\1\2\4 \7
This should be a bit better, but will only replace just 1 URL

The \0 is replaced by the entire matched string, whereas \x (where x is a number other than 0 starting at 1) will be replaced by each subpart of your matched string based on what you wrap in parentheses and the order those groups appear. Your solution is as follows:
ereg_replace("[[:alpha:]]+://([^<>[:space:]]+[:alnum:]*)[[:alnum:]/]", "\\1
I haven't been able to test this though so let me know if it works.

I think this should do it (I haven't tested it):
preg_match('/^http[s]?:\/\/(.+?)\/.*/i', $main_url, $matches);
$final_url = ''.$matches[1].'';

I'm surprised no one remembers PHP's parse_url function:
$url = 'http://www.youtube.com/watch?v=spsnQWtsUFM';
echo parse_url($url, PHP_URL_HOST); // displays "www.youtube.com"
I think you know what to do from there.

$result = preg_replace('%(http[s]?://)(\S+)%', '\2', $subject);

The code with regex does not work completely.
I made this code. It is much more comprehensive, but it works:
See the result here: http://cht.dk/data/php-scripts/inc_functions_links.php
See the source code here: http://cht.dk/data/php-scripts/inc_functions_links.txt

Related

Replace many code lines in PHP between tags

I have gotten a page php with this line:
$url = file_get_contents('http://web.com/rss.php');
Now I want replace this:
<link>http://web.com/download/45212/lorem-ipsum</link>
<link>http://web.com/download/34210/dolor-sit</link>
<link>http://web.com/download/78954/consectetur-adipiscing</link>
<link>http://web.com/download/77741/laboris-nisi</link>...
With this:
<link>http://otherweb.com/get-d/45212</link>
<link>http://otherweb.com/get-d/34210</link>
<link>http://otherweb.com/get-d/78954</link>
<link>http://otherweb.com/get-d/77741</link>...
I have replaced a part with str_replace but I don't know to replace the other part.
This is what i have done for the moment:
$url = str_replace('<link>http://web.com/download/','<link>http://otherweb.com/get-d/', $url);

You can do this all with a single line of regex :)
Regex
The below regex will detect your middle numbered section....
<link>http:\/\/web\.com\/download\/(.*?)\/.*?<\/link>
PHP
To use this inside PHP you could use this line of code
$url = preg_replace("/<link>http:\/\/web\.com\/download\/(.*?)\/.*?<\/link>/m", "<link>http://otherweb.com/get-d/$1</link>", $url);
This should do exactly what you need!
Explanation
The way it works is preg_replace looks for <link>http://web.com/download/ at the start and /{something}</link> at the end. It captures the middle area into $1
So when we run preg_replace ($pattern, $replacement, $subject) we tell PHP to just find that middle part (the numbers in your URLS) and embed them into "<link>http://otherweb.com/get-d/$1</link>".
I tested it and it seems to be working :)
Edit: I would propose this answer as best for you as it does everything with a single line, and does not require any str_replace. My answer also will function even if the middle section is alphanumeric, and not only if it is numeric.

All you want to do is:
extract the relevant data e.g. the five digit number
put the extracted part into a new context
$input = 'http://web.com/download/45212/lorem-ipsum';
echo preg_replace('/.*\/(\d+).*/', 'http://otherweb.com/get-d/$1', $input);
To extract the relevant part, you can use (\d+) which means: find one or more digits, the parentheses make this a matching group, so you can access this value via $1.
To match and replace the whole line, you have to augment the pattern with .* (which means, find any number of any character) before and after the (\d+) part.
With this set up, the whole string matches, so the whole string will be replaced.

You should replace the initial part of link with a token, and then preg_replace the end of your string searching for the first / and replacing with the </link>. And so you replace your token with the initial part you desire.
$url = str_replace('<link>http://web.com/download/','init', $url);
$url = preg_replace("/\/.+/", "</link>", $url);
$url = str_replace('init', '<link>http://otherweb.com/get-d/', $url);

You're just missing a simple regex to clean up the last part.
Here's how I did it:
$messed_up = '
<link>http://web.com/download/45212/lorem-ipsum</link>
<link>http://web.com/download/34210/dolor-sit</link>
<link>http://web.com/download/78954/consectetur-adipiscing</link>
<link>http://web.com/download/77741/laboris-nisi</link>';
// Firstly we can clean up the first part (like you did) with str_replace
$clean = str_replace('web.com/download/','otherweb.com/get-d/', $messed_up);
// After that we'll use preg_replace to get rid of the last part
$clean = preg_replace("/(.+\/\d+)\/.*(<.*)/", "$1$2", $clean);
printf($clean);
/* Returns:
<link>http://otherweb.com/get-d/4521</link>
<link>http://otherweb.com/get-d/3421</link>
<link>http://otherweb.com/get-d/7895</link>
<link>http://otherweb.com/get-d/7774</link>
*/
I made this quickly so there might be some room for improvement but it definitely works.
You can check out the code in practice HERE.
If you're interested in learning PHP RegEx This is a great place to practice.

How to get a number from a html source page?

I'm trying to retrieve the followed by count on my instagram page. I can't seem to get the Regex right and would very much appreciate some help.
Here's what I'm looking for:
y":{"count":
That's the beginning of the string, and I want the 4 numbers after that.
$string = preg_replace("{y"\"count":([0-9]+)\}","",$code);
Someone suggested this ^ but I can't get the formatting right...

You haven't posted your strings so it is a guess to what the regex should be... so I'll answer on why your codes fail.
preg_replace('"followed_by":{"count":\d')
This is very far from the correct preg_replace usage. You need to give it the replacement string and the string to search on. See http://php.net/manual/en/function.preg-replace.php
Your second usage:
$string = preg_replace(/^y":{"count[0-9]/","",$code);
Is closer but preg_replace is global so this is searching your whole file (or it would if not for the anchor) and will replace the found value with nothing. What your really want (I think) is to use preg_match.
$string = preg_match('/y":\{"count(\d{4})/"', $code, $match);
$counted = $match[1];
This presumes your regex was kind of correct already.
Per your update:
Demo: https://regex101.com/r/aR2iU2/1
$code = 'y":{"count:1234';
$string = preg_match('/y":\{"count:(\d{4})/', $code, $match);
$counted = $match[1];
echo $counted;
PHP Demo: https://eval.in/489436
I removed the ^ which requires the regex starts at the start of your string, escaped the { and made the\d be 4 characters long. The () is a capture group and stores whatever is found inside of it, in this case the 4 numbers.
Also if this isn't just for learning you should be prepared for this to stop working at some point as the service provider may change the format. The API is a safer route to go.

This regexp should capture value you're looking for in the first group:
\{"count":([0-9]+)\}
Use it with preg_match_all function to easily capture what you want into array (you're using preg_replace which isn't for retrieving data but for... well replacing it).
Your regexp isn't working because you didn't escaped curly brackets. And also you didn't put count quantifier (plus sign in my example) so it would only capture first digit anyway.

preg_replace regex tags not being replaced

Hoping you can help. Pretty new to regex and although I have written this regex it doesnt seem to match. I dont recieve an error message so im assuming the syntax is correct but its just not being applied?
I want the regex to replace content like
{foo}bar{/foo} with
bar
Here is my code:
$regex = "#([{].*?[}])(.*?)([{]/.*?[}])#e";
$return = preg_replace($regex,"('$2')",$return);
Hope someone can help. Not sure why it doesnt seem to work.
Thanks for reading.

Your regex does work, however it isn't smart enough to know that the end tag has to be the same as the start tag. I would use this instead. I've also simplified it a little:
$regex = '#{([^}]*)}(.*?)\{/\\1}#';
echo preg_replace('{foo}bar{/foo}', '$2', $str); // outputs "bar"
Codepad

Refering to my comment above:
#(?:[{](.*?)[}])(.*?)(?:[{]/\1[}])#
uses a backreference to keep the tags equal. Also, I used non-capture parentheses to keep the useless groups out: $1 will be the tag name, and $2 will be the tag content.
Note that you will have to apply the replacement several times if your tags can nest.

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.

PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).

I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);

Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);

The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.

I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#

This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*

You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

PHP if string contains URL isolate it

In PHP, I need to be able to figure out if a string contains a URL. If there is a URL, I need to isolate it as another separate string.
For example: "SESAC showin the Love! http://twitpic.com/1uk7fi"
I need to be able to isolate the URL in that string into a new string. At the same time the URL needs to be kept intact in the original string. Follow?
I know this is probably really simple but it's killing me.

Something like
preg_match('/[a-zA-Z]+:\/\/[0-9a-zA-Z;.\/?:#=_#&%~,+$]+/', $string, $matches);
$matches[0] will hold the result.
(Note: this regex is certainly not RFC compliant; it may fetch malformed (per the spec) URLs. See http://www.faqs.org/rfcs/rfc1738.html).

this doesn't account for dashes -. needed to add -
preg_match('/[a-zA-Z]+:\/\/[0-9a-zA-Z;.\/\-?:#=_#&%~,+$]+/', $_POST['string'], $matches);

URLs can't contain spaces, so...
\b(?:https?|ftp)://\S+
Should match any URL-like thing in a string.
The above is the pure regex. PHP preg_* and string escaping rules apply before you can use it.

$test = "SESAC showin the Love! http://twitpic.com/1uk7fi";
$myURL= strstr ($test, "http");
echo $myURL; // prints http://twitpic.com/1uk7fi

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP URL to Link with Regex - php

I think this should do it (I haven't tested it): preg_match('/^http[s]?:\/\/(.+?)\/.*/i', $main_url, $matches); $final_url = ''.$matches[1].'';

I'm surprised no one remembers PHP's parse_url function: $url = 'http://www.youtube.com/watch?v=spsnQWtsUFM'; echo parse_url($url, PHP_URL_HOST); // displays "www.youtube.com" I think you know what to do from there.

$result = preg_replace('%(http[s]?://)(\S+)%', '\2', $subject);

The code with regex does not work completely. I made this code. It is much more comprehensive, but it works: See the result here: http://cht.dk/data/php-scripts/inc_functions_links.php See the source code here: http://cht.dk/data/php-scripts/inc_functions_links.txt

Related

Replace many code lines in PHP between tags

How to get a number from a html source page?

preg_replace regex tags not being replaced

Regular expression anchor text for a link

PHP if string contains URL isolate it

Categories

Resources