Grab URL within a string which contains HTML code

Grab URL within a string which contains HTML code - php

I have a string, for example:
$html = '<p>helloworld</p><p>helloworld</p>';
And I want to search the string for the first URL that starts with youtube.com or youtu.be and store it in variable $first_found_youtube_url.
How can I do this efficiently?
I can do a preg_match or strpos looking for the urls but not sure which approach is more appropriate.

I wrote this function a while back, it uses regex and returns an array of unique urls. Since you want the first one, you can just use the first item in the array.
function getUrlsFromString($string) {
$regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#i';
preg_match_all($regex, $string, $matches);
$matches = array_unique($matches[0]);
usort($matches, function($a, $b) {
return strlen($b) - strlen($a);
});
return $matches;
}
Example:
$html = '<p>helloworld</p><p>helloworld</p>';
$urls = getUrlsFromString($html);
$first_found_youtube = $urls[0];
With YouTube specific regex:
function getYoutubeUrlsFromString($string) {
$regex = '#(https?:\/\/(?:www\.)?(?:youtube.com\/watch\?v=|youtu.be\/)([a-zA-Z0-9]*))#i';
preg_match_all($regex, $string, $matches);
$matches = array_unique($matches[0]);
usort($matches, function($a, $b) {
return strlen($b) - strlen($a);
});
return $matches;
}
Example:
$html = '<p>helloworld</p><p>helloworld</p>';
$urls = getYoutubeUrlsFromString($html);
$first_found_youtube = $urls[0];

you can parse the html with DOMDocument and look for youtube url's with stripos, something like this
$html = '<p>helloworld</p><p>helloworld</p>';
$DOMD = #DOMDocument::loadHTML($html);
foreach($DOMD->getElementsByTagName("a") as $url)
{
if (0 === stripos($url->getAttribute("href") , "https://www.youtube.com/") || 0 === stripos($url->getAttribute("href") , "https://www.youtu.be"))
{
$first_found_youtube_url = $url->getAttribute("href");
break;
}
}
personally, i would probably use
"youtube.com"===parse_url($url->getAttribute("href"),PHP_URL_HOST)
though, as it would get http AND https links.. which is probably what you want, though strictly speaking, not what you're asking for in top post right now..

I think this will do what you are looking for, I have used preg_match_all simply because I find it easier to debug the regexes.
<?php
$html = '<p>helloworld</p><p>helloworld</p>';
$pattern = '/https?:\/\/(www\.)?youtu(\.be|\com)\/[a-zA-Z0-9\?=]*/i';
preg_match_all($pattern, $html, $matches);
// print_r($matches);
$first_found_youtube = $matches[0][0];
echo $first_found_youtube;
demo - https://3v4l.org/lFjmK

Related

Find URL in string and turn into a link

I'm using the code given on this page to look through a string and turn the URL into an HTML link.
It works quite well, but there is a little issue with the "replace" part of it.
The problem occurs when I have almost identical links. For example:
https://example.com/page.php?goto=200
and
https://example.com/page.php
Everything will be fine with the first link, but the second will create a <a> tag in the first <a> tag.
First run
https://example.com/page.php?goto=200
Second
https://example.com/page.php?goto=200">https://example.com/page.php?goto=200</a>
Because it's also replacing the html link just created.
How do I avoid this?
<?php
function turnUrlIntoHyperlink($string){
//The Regular Expression filter
$reg_exUrl = "/(?i)\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/";
// Check if there is a url in the text
if(preg_match_all($reg_exUrl, $string, $url)) {
// Loop through all matches
foreach($url[0] as $newLinks){
if(strstr( $newLinks, ":" ) === false){
$link = 'http://'.$newLinks;
}else{
$link = $newLinks;
}
// Create Search and Replace strings
$search = $newLinks;
$replace = ''.$link.'';
$string = str_replace($search, $replace, $string);
}
}
//Return result
return $string;
}
?>

You need to add a whitespace identifier \s in your regex at the start, also remove \b because \b only returns the last match.
You regex can written as:
$reg_exUrl = "/(?i)\s((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/"
check this one: https://regex101.com/r/YFQPlZ/1

I have change the replace part a bit, since I couldn't get the suggested regex to work.
Maybe it can be done better, but I'm still learning :)
function turnUrlIntoHyperlink($string){
//The Regular Expression filter
$reg_exUrl = "/(?i)\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/";
// Check if there is a url in the text
if(preg_match_all($reg_exUrl, $string, $url)) {
// Loop through all matches
foreach($url[0] as $key => $newLinks){
if(strstr( $newLinks, ":" ) === false){
$url = 'https://'.$newLinks;
}else{
$url = $newLinks;
}
// Create Search and Replace strings
$replace .= ''.$url.',';
$newLinks = '/'.preg_quote($newLinks, '/').'/';
$string = preg_replace($newLinks, '{'.$key.'}', $string, 1);
}
$arr_replace = explode(',', $replace);
foreach ($arr_replace as $key => $link) {
$string = str_replace('{'.$key.'}', $link, $string);
}
}
//Return result
return $string;
}

I'm building a php web scraper and preg_match gives an error

I'm building an ebay web scraper and the preg_match for price throws an error.
preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
$title = $matches[1];
preg_match('/<img id="icImg"[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
$img = $matches[1];
preg_match('/<span id\=\"prcIsum\"\>([^<]+)<\/span>/i', $data, $matches);
$price = $matches[1];
The title and the img are being scraped okay but I get this PHP error on price span element: PHP Notice: Undefined offset: 1

The error is fairly self explanatory; $matches[1] doesn't exist. This is probably because there is no match in the $data string.
preg_match() returns 1 if the pattern matches given subject, 0 if it
does not, or FALSE if an error occurred.
$isMatch = preg_match('/<title>([^<]+)<\/title>/i', $data, $matches);
if($isMatch == 1){
$title = $matches[1];
}
$isMatch = preg_match('/<img id="icImg"[^>]*src=[\'"]([^\'"]+)[\'"][^>]*>/i', $data, $matches);
if($isMatch == 1){
$img = $matches[1];
}
$isMatch = preg_match('/<span id\=\"prcIsum\"\>([^<]+)<\/span>/i', $data, $matches);
if($isMatch == 1){
$price = $matches[1];
}
Perhaps you should make sure the regex is valid for the $data you are using and that it does in fact return matches.

It is not preg_match() that throws error but when you assign $matches1 it fails because there are no matches returned by function. So you need to check your REGEX. Also there's no sense in using preg_match for HTML parsing. You can use DOM parser. For this regex to work u need add "m" modifier. But the better solution would be:
$doc = new DOMDocument();
$doc->loadHTMLFile($data); //where $data is HTML
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/span[#id='prcIsum']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo $element->nodeName;
echo $element->nodeValue;
}
}
The other option is to use getElementById() method.
This is changed example from php.net site.

PHP Regex get reverse number

i have this:
$pattern = 'dev/25{LASTNUMBER}/P/{YYYY}'
$var = 'dev/251/P/2014'
in this situation {LASTNUMBER} = 1 how to get this from $var
vars in pattern can by more always in {}
pattern can by different example :
$pattern = '{LASTNUMBER}/aa/bb/P/{OtherVar}'
in this situation var will by 1/aa/bb/p/some and want get 1
I need get {LASTNUMBER} have pattern and have results
Ok maybe is not possible :) or very very hard

use a regex..
if (preg_match('~dev/25([0-9])/P/[0-9]{4}~', $var, $m)) {
$lastnum = $m[1];
}

$parts = explode("/", $pattern);
if (isset($parts[1])) {
return substr($parts[1], -1);
}
will be faster than regex :)

You probably need this:
<?php
$pattern = 'dev/251/P/2014';
preg_match_all('%dev/25(.*?)/P/[\d]{4}%sim', $pattern, $match, PREG_PATTERN_ORDER);
$match = $match[1][0];
echo $match; // echo's 1
?>
Check it online
If you need to loop trough results you can use:
<?php
$pattern = <<< EOF
dev/251/P/2014
dev/252/P/2014
dev/253/P/2014
dev/254/P/2014
dev/255/P/2014
EOF;
preg_match_all('%dev/25(.*?)/P/[\d]{4}%sim', $pattern , $match, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($match[1]); $i++) {
echo $match[1][$i]; //echo's 12345
}
?>
Check in online

change content outside and inside brackets with preg_replace

i need to change this line: count(id)
to line like this count(tableName.id)
i try to do this with preg match and replace like this:
$a = "count(id)";
$regex = "/\w{3,}+\W/";
$dd = preg_match("/\(.*?\)/", $a, $matches);
$group = $matches[0];
if (preg_match($regex, $a)) {
$c = preg_replace("$group", "(table.`$group`)", $a);
var_dump($c);
}
output that i got is : count((table.(id))) its outputting me extra brackets . i know the problem but i can't find solution because my regex knowledge not so good.

$a = "count(id)";
$regex = "/\w{3,}+\W/";
$dd = preg_match("/\((.*?)\)/", $a, $matches);
$group = $matches[1]; // <-- you'll get error if the above regex doesn't match!
if (preg_match($regex, $a)) {
$c = preg_replace("/$group/", "table.$group", $a);
}

php regular expression to match string if NOT in an HTML tag

I'm trying to solve this bug in Drupal's Hashtags module: http://drupal.org/node/1718154
I've got this function that matches every word in my text that is prefixed by "#", like #tag:
function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
I need to ignore internal links in pages, such as link, or, more in general, any word prefixed by # that appears inside an HTML tag (so preceeded by < and followed by >).
Any idea how can I achieve this?

Can you strip the tags first because matching (using the strip_tags function)?
function hashtags_get_tags($text) {
$text = strip_tags($text);
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
A regular expression is going to be tricky if you want to only match hashtags that are not inside an HTML tag.

You could throw out the tags before hand using preg_replace
function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
$text=preg_replace("/<[^>]*>/","",$text);
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}

I made this function using PHP DOM.
It returns all links that have # in the href.
If you want it to only remove internal hash tags, replace this line:
if(strpos($link->getAttribute('href'), '#') === false) {
with this:
if(strpos($link->getAttribute('href'), '#') !== 0) {
This is the function:
function no_hashtags($text) {
$doc = new DOMDocument();
$doc->loadHTML($text);
$links = $doc->getElementsByTagName('a');
$nohashes = array();
foreach($links as $link) {
if(strpos($link->getAttribute('href'), '#') === false) {
$temp = new DOMDocument();
$elem = $temp->importNode($link->cloneNode(true), true);
$temp->appendChild($elem);
$nohashes[] = $temp->saveHTML();
}
}
// return $nohashes;
return implode('', $nohashes);
// return implode(',', $nohashes);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Grab URL within a string which contains HTML code - php

Related

Find URL in string and turn into a link

I'm building a php web scraper and preg_match gives an error

PHP Regex get reverse number

change content outside and inside brackets with preg_replace

php regular expression to match string if NOT in an HTML tag

Categories

Resources