I'm quite new in codeigniter. I'm using the current_url() function to preserve previously viewed page's URL. But the function (I think from different ajax calls) gives i.e. jpg files' url.
Like this:
/uploads/default/files/HTC.jpg
I'd like to avoid these and just preserve those URLs which are used in Browser's URL bar.
Any idea? Thanks in advance!
I assume you are using current_url() and then saving the output of this function somewhere for later retrieval.
So you could, before you save the string, perform a regex check to see if it fits the format you want:
$pattern = '~.+\.[a-zA-Z]{0,3}$~';
$string = current_url();
preg_match($pattern, $string, $matches);
if (empty($matches)) {
// We can save the url
}
The regex will hit on Urls which end in a . with zero to 3 letters:
HTC.jpg will fail
la/di.php will fail
la/items/2 will pass
Related
I have file websites.txt and this file has text not arranged (it is a source html code) and I'd like to search this source code and find urls that match example.com/sub/text (so any url start with example.com/sub/text should be matched) and print/echo them.
I am using file_get_contents and need to print only that matches http://www.example.com/sub/text/
I tried preg_match but I do not know how to create a pattern from (http://www.example.com/sub/text/)
Try this:
$pattern="%http://www.+[a-z]+/+[a-z]+/+[a-z]+/%";
if(preg_match_all($pattern,$content,$match)) {
print_r($match);
}
pdf -> something like this: $pattern="%http://www.+[a-z]+/+[a-z]+/+[a-z]+.pdf%";
Check this for understanding purpose..copy and test in your side..
$contentss = file_get_contents("http://www.ncbi.nlm.nih.gov/pubmed?LinkName=pubmed_pubmed&from_uid=18032633" );
preg_match('/<div class="rprt">(.*)<\/div>/',$contentss,$matches);
echo $matches[0];
I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}
I need a PHP validation function for URL with Query string (parameters seperated with &). currently I've the following function for validating URLs
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
echo preg_match($pattern, $url);
This function correctly validates input like
google.com
www.google.com
http://google.com
http://www.google.com ...etc
But this won't validate the URL when it comes with parameters (Query string). for eg.
http://google.com/index.html?prod=gmail&act=inbox
I need a function that accepts both types of URL inputs. Please help. Thanks in advance.
A simple filter_var
if(filter_var($yoururl, FILTER_VALIDATE_URL))
{
echo 'Ok';
}
might do the trick, although there are problems with url not preceding the schema:
http://codepad.org/1HAdufMG
You can turn around the issue by placing an http:// in front of urls without it.
As suggested by #DaveRandom, you could do something like:
$parsed = parse_url($url);
if (!isset($parsed['scheme'])) $url = "http://$url";
before feeding the filter_var() function.
Overall it's still a simpler solution than some extra-complicated regex, though..
It also has these flags available:
FILTER_FLAG_PATH_REQUIRED FILTER_VALIDATE_URL Requires the URL to
contain a path part. FILTER_FLAG_QUERY_REQUIRED FILTER_VALIDATE_URL
Requires the URL to contain a query string.
http://php.net/manual/en/function.parse-url.php
Some might think this is not a 100% bullet-proof,
but you can give a try as a start
How can you check if text typed in from a user is an url?
Lets say I want to check if the text is a url and then check if the string has "youtube.com" in it. Afterwards, I want to get the portion of the link which is of interest for me between the substrings "watch?v=" and any "&" parameters if they do exist.
parse_url() is probably a good choice here. If you get a bad URL, the function will return false. Otherwise, it will break the URL up into its component pieces and you can use the ones you need.
Example:
$urlParts = parse_url('http://www.youtube.com/watch?v=MX0D4oZwCsA');
if ($urlParts == false) echo "Bad URL";
else echo "Param string is ".$urlParts['query'];
Outputs:
Param string is v=MX0D4oZwCsA
You could split the query portion as needed using explode() for specific parameters.
Edit: Keep in mind that parse_url() tries as hard as possible to parse the string it is given, so bad URLs will often succeed, although the resulting data array will be very odd. It's obviously up to you how definitive you want your validation to be and what exactly you require out of your user input.
preg_match('#watch\?v=([^&]+)#', $url, $matches);
echo $matches[1];
Its strongly recommended not to use parse_url() for url validation.
here is a nice solution.
I am trying to get the page or last directory name from a url
for example if the url is: http://www.example.com/dir/ i want it to return dir or if the passed url is http://www.example.com/page.php I want it to return page Notice I do not want the trailing slash or file extension.
I tried this:
$regex = "/.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*/i";
$name = strtolower(preg_replace($regex,"$2",$url));
I ran this regex in PHP and it returned nothing. (however I tested the same regex in ActionScript and it worked!)
So what am I doing wrong here, how do I get what I want?
Thanks!!!
Don't use / as the regex delimiter if it also contains slashes. Try this:
$regex = "#^.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*$#i";
You may try tho escape the "/" in the middle. That simply closes your regex. So this may work:
$regex = "/.*\.(com|gov|org|net|mil|edu)\/([a-z_\-]+).*/i";
You may also make the regex somewhat more general, but that's another problem.
You can use this
array_pop(explode('/', $url));
Then apply a simple regex to remove any file extension
Assuming you want to match the entire address after the domain portion:
$regex = "%://[^/]+/([^?#]+)%i";
The above assumes a URL of the format extension://domainpart/everythingelse.
Then again, it seems that the problem here isn't that your RegEx isn't powerful enough, just mistyped (closing delimiter in the middle of the string). I'll leave this up for posterity, but I strongly recommend you check out PHP's parse_url() method.
This should adequately deliver:
substr($s = basename($_SERVER['REQUEST_URI']), 0, strrpos($s,'.') ?: strlen($s))
But this is better:
preg_replace('/[#\.\?].*/','',basename($path));
Although, your example is short, so I cannot tell if you want to preserve the entire path or just the last element of it. The preceding example will only preserve the last piece, but this should save the whole path while being generic enough to work with just about anything that can be thrown at you:
preg_replace('~(?:/$|[#\.\?].*)~','',substr(parse_url($path, PHP_URL_PATH),1));
As much as I personally love using regular expressions, more 'crude' (for want of a better word) string functions might be a good alternative for you. The snippet below uses sscanf to parse the path part of the URL for the first bunch of letters.
$url = "http://www.example.com/page.php";
$path = parse_url($url, PHP_URL_PATH);
sscanf($path, '/%[a-z]', $part);
// $part = "page";
This expression:
(?<=^[^:]+://[^.]+(?:\.[^.]+)*/)[^/]*(?=\.[^.]+$|/$)
Gives the following results:
http://www.example.com/dir/ dir
http://www.example.com/foo/dir/ dir
http://www.example.com/page.php page
http://www.example.com/foo/page.php page
Apologies in advance if this is not valid PHP regex - I tested it using RegexBuddy.
Save yourself the regular expression and make PHP's other functions feel more loved.
$url = "http://www.example.com/page.php";
$filename = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
Warning: for PHP 5.2 and up.