Let's say I have the following string:
<?php
$str = 'To subscribe go to Here';
?>
What I'm trying to do is find the URLS within the string that have a specific domain name, "foo.com" for this example, then append the url.
What I want to accomplish:
<?php
$str = 'To subscribe go to Here';
?>
If the domain name in the urls isn't foo.com, I don't want them to be appended.
You can use parse_url() function and the DomDoccument class of php to manipulate the urls, like this:
$str = 'To subscribe go to Here';
$dom = new DomDocument();
$dom->loadHTML($str);
$urls = $dom->getElementsByTagName('a');
foreach ($urls as $url) {
$href = $url->getAttribute('href');
$components = parse_url($href);
if($components['host'] == "foo.com"){
$components['path'] .= "?package=2";
$url->setAttribute('href', $components['scheme'] . "://" . $components['host'] . $components['path']);
}
$str = $dom->saveHtml();
}
echo $str;
Output:
To subscribe go to [Here]
^ href="http://foo.com/subscribe?package=2"
Here are the references:
The DOMDocument class
parse_url()
Related
I have used the following code to replace all the links on HTML page.
$output = file_get_contents($turl);
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $output);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $newOutput;
I want to modify this code to replace only links inside the body and not in the head.
You can use DOMDocument to parse and manipulate the source. It's always a better idea to use a dedicated parser for a task like this instead of using string operations.
// Parse the HTML into a document
$dom = new \DOMDocument();
$dom->loadXML($html);
// Loop over all links within the `<body>` element
foreach($dom->getElementsByTagName('body')[0]->getElementsByTagName('a') as $link) {
// Save the existing link
$oldLink = $link->getAttribute('href');
// Set the new target attribute
$link->setAttribute('target', "_parent");
// Prefix the link with the new URL
$link->setAttribute('href', "http://localhost/e/site.php?turl=" . urlencode($oldLink));
}
// Output the result
echo $dom->saveHtml();
See https://eval.in/843484
You can decapitate the code.
Finds the body and separate the head from the body to two variables.
//$output = file_get_contents($turl);
$output = "<head> blablabla
Bla bla
</head>
<body>
Foobar
</body>";
//Decapitation
$head = substr($output, 0, strpos($output, "<body>"));
$body = substr($output, strpos($output, "<body>"));
// Find body tag and parse body and head to each variable
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $body);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $head . $newOutput;
https://3v4l.org/WYcYP
Tried to research what I'm doing wrong here, but no luck so far. I want to pull the links and URLs in this MRSS feed using this script, but it's not working. Thought all I needed to do was use namespaces to get the child elements out, but no luck:
<?php
$html = "";
$url = "http://feeds.nascar.com/feeds/video?command=search_videos&media_delivery=http&custom_fields=adtitle%2cfranchise&page_size=100&sort_by=PUBLISH_DATE:DESC&token=217e0d96-bd4a-4451-88ec-404debfaf425&any=franchise:%20Preview%20Show&any=franchise:%20Weekend%20Top%205&any=franchise:Up%20to%20Speed&any=franchise:Press%20Pass&any=franchise:Sprint%20Cup%20Practice%20Clips&any=franchise:Sprint%20Cup%20Highlights&any=franchise:Sprint%20Cup%20Final%20Laps&any=franchise:Sprint%20Cup%20Victory%20Lane&any=franchise:Sprint%20Cup%20Post%20Race%20Reactions&any=franchise:All%20Access&any=franchise:Nationwide%20Series%20Qualifying%20Clips&any=franchise:Nationwide%20Series%20Highlights&any=franchise:Nationwide%20Series%20Final%20Laps&any=franchise:Nationwide%20Series%20Victory%20Lane&any=franchise:Nationwide%20Series%20Post%20Race%20Reactions&any=franchise:Truck%20Series%20Qualifying%20Clips&any=franchise:Truck%20Series%20Highlights&any=franchise:Truck%20Series%20Final%20Laps&any=franchise:Truck%20Series%20Victory%20Lane&any=franchise:Truck%20Series%20Post%20Race%20Reactions&output=mrss";
$xml = simplexml_load_file($url);
$namespaces = $xml->getNamespaces(true); // get namespaces
for($i = 0; $i < 50; $i++){ // will return the 50 most recent videos
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$pubDate = $xml->channel->item[$i]->pubDate;
$description = $xml->channel->item[$i]->description;
$titleid = $xml->channel->item[$i]->children($namespaces['bc'])->titleid;
$url = $xml->channel->item[$i]->children($namespaces['media'])->url;
$html .= "<h3>$title</h3>$description<p>$pubDate<p>$link<p>Video ID: $titleid<p>
<iframe width='480' height='270' src='http://link.brightcove.com/services/player/bcpid3742068445001?bckey=//my API token goes here &bctid=$titleid&autoStart=false' frameborder='0'></iframe><hr/>";/* this embed code is from the youtube iframe embed code format but is actually using the embedded Ooyala player embedded on the Campus Insiders page. I replaced any specific guid (aka video ID) numbers with the "$guid" variable while keeping the Campus Insider Ooyala publisher ID, "eb3......fad" */
}
echo $html;
?>
I take it this isn't the right approach:
$url = $xml->channel->item[$i]->children($namespaces['media'])->url;
What am I doing wrong here?
Thanks for any and all help!
MD
SimpleXML is deceptively named as it is more difficult to use than DOMDocument or the other PHP XML extensions. To get the URL, you'll need to access the url attribute of the media:content node:
<media:content duration="95" medium="video" type="video/mp4"
url="http://brightcove.meta.nascar.com.edgesuite.net/vod/etc"/>
Target the first <media:content> node using
$xml->channel->item[$i]->children($namespaces['media'])->content[0]
and get its attributes:
$m_attrs =
$xml->channel->item[$i]->children($namespaces['media'])->content[0]->attributes();
You can then access the url attribute:
echo "URL: " . $m_attrs["url"] . "\n";
Your code should thus be:
$titleid = $xml->channel->item[$i]->children($namespaces['bc'])->titleid;
$m_attrs = $xml->channel->item[$i]->children($namespaces['media'])->content[0]->attributes();
$url = $m_attrs["url"];
$html .= "<h3>$title</h3>$description<p>$pubDate<p>$link<p>Video ID: $titleid<p> (etc.)";
I have a small piece of code that checks a string for a url and adds the < a href> tag to create a link. I also have it check the string for a youtube link and then add rel="youtube" to the < a> tag.
How can I get the code to only add rel to the youtube links?
How can I get it to add a different rel to any type of image link?
$text = "http://site.com a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M here is another site";
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '\4', $text );
if(preg_match('/http:\/\/www\.youtube\.com\/watch\?v=[^&]+/', $linkstring, $vresult)) {
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '<a rel="youtube" href="\0">\4</a>', $text );
$type= 'youtube';
}
else {
$type = 'none';
}
echo $text;
echo $linkstring, "<br />";
echo $type, "<br />";
Try http://simplehtmldom.sourceforge.net/.
Code:
<?php
include('simple_html_dom.php');
$html = str_get_html('Link');
$html->find('a', 0)->rel = 'youtube';
echo $html;
Output:
[username#localhost dom]$ php dom.php
Link
You can build an entire page DOM or a simple single link with this library.
Detecting hostname of URL:
Pass the url to parse_url. parse_url returns an array of the URL parts.
Code:
print_r(parse_url('http://www.youtube.com/watch?v=UyxqmghxS6M'));
Output:
Array
(
[scheme] => http
[host] => www.youtube.com
[path] => /watch
[query] => v=UyxqmghxS6M
)
Try the following:
//text
$text = "http://site.com/bounty.png a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M&featured=true here is another site";
//Youtube links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}youtube\.com\/watch\?v=([a-z0-9\-_\|]{11})[^\s]*/i";
$replacement = '<a rel="youtube" href="http://www.youtube.com/watch?v=\3">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
//image links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}[^\/]+\/[^\s]+\.(png|jpg|jpeg|bmp|gif)[^\s]*/i";
$replacement = '<a rel="image" href="\0">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
note that the latter can only detect links to images which have an extension. As such, links like www.example.com?image=3 will not be detected.
i have this code
preg_match_all('%(?:youtube\.com/(?:[^/]+/.+/|(?:v|e(?:mbed)?)/|.*[?&]v=)|youtu\.be/)([^"&?/ ]{11})%i', $asd, $match);
to find youtube key of urls like
http://www.youtube.com/watch?v=ZiJRPREeQ1Q
<br />
http://www.youtube.com/watch?v=GaEZgqxPHLs&feature=related
this work good to find the code ZiJRPREeQ1Q and GaEZgqxPHLs , now i want to replace all the html line with a new code
wanna to use
preg_replace
to find the whole youtube url
*
to a new code how can i do that ?
--------------adds--------------
after i get the code of youtube from url by preg_math_all
i used this code to extract the codes
foreach($match[1] as $youtube){
// $youtube; // this handle the youtube code
$match = ""; // what can i write here relative to $youtube ?
$str .= preg_replace($match, 'new code',$content); // $content handle the whole thread that contain the youtube url <a href=*>*</a>
}
the only thing that i need that what's regular expression that i can use to replace youtube code
$html = file_get_contents($url); //or curl function
$re="<link itemprop=\"embedURL\" href=\"(.+)\">";
preg_match_all("/$re/siU", $html, $matches);
$youtube = $matches[1][0];
or
$html = file_get_contents($url); //or curl function
$re="<link itemprop=\"url\" href=\"(.+)\">";
preg_match_all("/$re/siU", $html, $matches);
$youtube = $matches[1][0];
I need a regex that will give me the string inside an href tag and inside the quotes also.
For example i need to extract theurltoget.com in the following:
URL
Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/
Dont use regex for this. You can use xpath and built in php functions to get what you want:
$xml = simplexml_load_string($myHtml);
$list = $xml->xpath("//#href");
$preparedUrls = array();
foreach($list as $item) {
$item = parse_url($item);
$preparedUrls[] = $item['scheme'] . '://' . $item['host'] . '/';
}
print_r($preparedUrls);
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com
this expression will handle 3 options:
no quotes
double quotes
single quotes
'/href=["\']?([^"\'>]+)["\']?/'
Use the answer by #Alec if you're only looking for the base url part (the 2nd part of the question by #David)!
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
This will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html" class="myclass" rel="myrel
)
So you can use $href = $info["scheme"] . "://" . $info["host"]
Which gives you:
// http://www.mydomain.com
When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by #user2520237.
$html = 'URL';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);
this will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html
)
Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"];
Which gives you:
// http://www.mydomain.com/page.html
http://www.the-art-of-web.com/php/parse-links/
Let's start with the simplest case - a well formatted link with no extra attributes:
/<a href=\"([^\"]*)\">(.*)<\/a>/iU
For all href values replacement:
function replaceHref($html, $replaceStr)
{
$match = array();
$url = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);
if(count($match))
{
for($j=0; $j<count($match); $j++)
{
$html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
}
}
return $html;
}
$replaceStr = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);
echo $replaceHtml;
This will handle the case where there are no quotes around the URL.
/<a [^>]*href="?([^">]+)"?>/
But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.
/href="(https?://[^/]*)/
I think you should be able to handle the rest.
Because Positive and Negative Lookbehind are cool
/(?<=href=\").+(?=\")/
It will match only what you want, without quotation marks
Array (
[0] => theurltoget.com )