Parsing link from javascript function - php

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!

/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/

In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.

For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)

The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

Related

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

URL detection problems

i'm currently having some problems with detecting urls and making them clickable.
Until now it always worked fine, probably because we always tested this with real urls, but now the website is live, we're having some problems.
This was the code we used to detect them before
$content = preg_replace('!(((f|ht)tp://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $content);
$content = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)', '\\1\\2', $content);
It was doing a great job for normal urls, but some urls are giving problems:
- hk.linkedin.com
- www.test.com
- test.com
Also notice that some urls don't have http in fron of them.
I'm really not that good with regex, so I would very much appreciate it if somebody could help me figure this out.
What exactly you wanted to get. In this example, I can see blatant lack of understanding for regular expressions... but then, I see this exact code used in few codes according to Google Code Search. But those were made to find URLs in middle of text (not always what looks like URL is URL, but if it contains http:// or www it's sure that's URL.
Not everything needs to be done only using regular expressions. Those are helpful, but sometimes they make additional problems.
One of problems in regular expressions is that they don't have conditionals on result. You can use multiple regular expressions, but there is chance that something will be done wrongly (like affecting what previous regular expression has done). Just look at this. It assigns additional function (you can use e modifier, but it may make code unreadable).
<?php
$content = preg_replace_callback('{\b(?:(https?|ftp)://)?(\S+[.]\S+)\b}i',
'addHTTP', $content);
function addHTTP($matches) {
if(empty($matches[1])) {
return 'http://' . $matches[2] . '';
}
else {
return '' . $matches[2] . '';
}
}
Or two regular expressions (little harder to understand)...
$content = preg_replace('{\b(?:(?:https?|ftp)://)\S+[.]\S+\b}i',
'$0', $content);
$content = preg_replace('{\b(?<!["\'=><.])[-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+[.][-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+(?!["\'=><.])\b}i',
'http://$0', $content);
Also, you should avoid using target="". Users don't expect that new window will appear when clicking the link. After user will click such link he might wonder why "Go left" button doesn't work (hint: new window caused it to disappear). If somebody really wants to open link in new window he will do it yourself (it's not hard...).
Note that usually such stuff is linked with other helpers like this. For example, Stack Overflow uses some kind of Markdown modification which does more intelligent renaming, like changing plain text lists to HTML lists... But that all depends on what you need. If you only need processing links, you can try using those regexpes, but well...

How to stop BB Code manipulation (part two)?

I recently discovered an issue where people using BB Code to enter links are able to manipulate them.
They are meant to enter something like:
[LINK=http://www.domain.com]example text[/LINK]
However they can enter something like this to make the link color red:
[LINK=http://www.domain.com 'span style="color:red;"']example text[/LINK]
This is the code which converts it:
$text = preg_replace("/\[LINK\=(.*?)\](.*?)\[\/LINK\]/is", "<a href='$1' target='_blank'>$2</a>", $text);
Someone else was kind enough to provide a solution to a very similar problem but they want me to start a new question for this. Their solution just needs adapting. I have tried myself but I really can't get it to work. How to stop BB Code manipulation?
preg_replace_callback("/\\[LINK\=(.*?)\\\](.*?)\\[\/LINK\\]/is",
function (array $matches) {
if (filter_var($matches[1], FILTER_VALIDATE_URL))
return '<a href="'.
htmlspecialchars($matches[1], ENT_QUOTES).
'" target="_blank">'.
htmlspecialchars($matches[2])."</a>";
else
return "INVALID MARKUP";
}, $text);
Use a callback to validate the URL and don't forget htmlspecialchars.
I think the simplest and best solution might be to run the url through htmlspecialchars() to escape the weird characters. That way it wouldn't get put directly into the source, but would be escaped first, so it wouldn't be able to hack out of the href="...".
Instead of using a regex replace, use a regex match to extract the information that you want, in this case the link and link text.
Then write that information out in the right format. That should eliminate the opportunity to get weird data into the output.
You can even double-check the variables before using them to make sure they don't contain any HTML.

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories