How do you validate wikipedia URLs like these in PHP? - php

Url: https://en.m.wikipedia.org/wiki/Professional_Tax
Is not being validated with Regex:
function isValidURL($url) {
return preg_match('|^(http(s)?://)?[a-z0-9-]+\.(.[a-z0-9-]+)+(:[0-9]+)?(/.*)?$|i', $url);
}
So the purpose of this is: We have a whole lot of urls embedded inside posts (forum) - we want to create a script which will basically keep track of which urls are still good. For this we need to extract the URLs from the posts and create a database - which can be checked at intervals for their status codes.

To matching this URL You can use this:
^https?\:\/\/([\w\.]+)wikipedia.org\/wiki\/([\w]+\_?)+
This only match URL, but to validate which url's are still good (if I understand right it means active)... this is not job for a regex.

Here is an alternative regex for most URLs:
(?<![#\w])(((http|https)(:\/\/))?([\w\-_]{2,})(([\.])([\w\-_]*)){1,})([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])
You can experiment with regex here: https://regex101.com/

Related

Hide the Youtube URL in post content on WordPress while using get_the_content()

In my website, I'm using get_the_content() to display the contents of my posts. But, my post video URL, which is youtube here, also getting displayed along with my content.
I tried using preg_replace() to avoid the string that has the youtube URL. But I want to do it dynamically, by using beginning of the string
'https://www.youtube.com/watch?v' to replace it with empty string.
I'm exactly looking for something like this.
$description = preg_replace('https://www.youtube.com/watch?v=f0cReHKUiJM', '', get_the_content());
Please share your ideas on how can I accomplish this.
Thanks in advance.
If you literally want to remove all youtube urls from your get_the_content() string, (assuming the href value is double quoted) you can use this:
~https?:/{2}(?:w{3}\.)?youtube\.com[^"]+~
Pattern Demo
This can be implemented as:
$description=preg_replace('~https?:/{2}(?:w{3}\.)?youtube\.com[^"]+~','',get_the_content());
PHP Demo
But this will leave the "shell" of the link (...).
You could go one step further and remove tags and retain the text within:
~<a.*?href="https?://(?:www\.)?youtube\.com[^>]*>(.*?)</a>~
Pattern Demo
Implementation:
$description=preg_replace('~<a.*?https?://(?:www\.)?youtube\.com[^>]*>(.*?)</a>~','\1',get_the_content());
PHP Demo
The truth is, you haven't provided any full string samples from get_the_content() or explained if there were variations in the url or a tag structure (extra attributes or manner of quoting). These factors will likely impact the pattern design. My answer is mostly based on assumptions. Maybe it will help you. If it doesn't, it would be best if you updated your question.

PHP preg_match , check if language is defined in url

I would like to test for a language match in a url.
Url will be like : http://www.domainname.com/en/#m=4&guid=%some_param%
I want to check if there is an existing language code within the url. I was thinking something between these lines :
^(.*:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
or
^(http|https:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
I'm not that sharp with regex. can anyone help or point me towards the right direction ?
[https]+://[a-z-]+.([a-z])+/
try this,
http://www.regexr.com/ this is a easy site for creating regex
If you know the data you are testing is a url then I would not bother adding all of the url parts to the regex. Keep it simple like: /\/[a-z]{2}\// That looks for a two letter combination between two forward slashes. If you need to capture the language code then wrap it in parentheses: /\/([a-z]{2})\//

Parsing link from javascript function

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

Check for multiple patterns with preg_replace?

I am looking for a solution for validating submitted posts. I want to check if someone submits (within the post):
An Iframe for embedding YouTube or Vimeo video, replacing the correct width used in the Iframe
An URL, replaced by a HTML clickable link
An image URL, replacing it by a HTML
I was able to find the correct regex's for each of these requirements, but using 3 seperate preg_replace functions causes interference. For example, detecting an URL will also detect the URL inside the Iframe.
I have searched for a solution on this, both on Stackoverflow as on the rest of the internet. But I am not an expert, so perhaps someone could help me out or direct me to the right tutorial/website/how-to...
What you can do is first match the iframes with preg_match, and then replace them with a placeholder.
Then you can do the replacements for urls/images. Then, replace the iframe placeholders back with the iframes you matched earlier.
You can generate unique sequential placeholders by using preg_replace_callback, so that you get to run some code to increment a $placeholder_id for each replacement.
This is a general strategy that can often greatly simplify complex parsing.
You can simply pass and array of URL patterns to preg_replace() like this:
$pattern_array = array(
'/somepattern/',
'/someotherpattern/',
'/yetanotherpattern/',
)
$replacement_array = array(
'somereplacement',
'someotherreplacement',
'yetanotherreplacement'
}
$result = preg_replace($pattern_array, $replacement_array, $subject_string);

Extract URL from string

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks
John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);
There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.
Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?
This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}

Categories