for example i've got a string like this:
$html = '
test
test
test
hi
';
and i want to append the absolute url to all hrefs where no abolute domain is given.
$html = '
test
test
test
hi
';
whats the best way to do that? i guess something with RegEx, but my RegEx skills are ** ;)
thanks in advance!
found a good way :
$html = preg_replace("#(<\s*a\s+[^>]*href\s*=\s*[\"'])(?!http)([^\"'>]+)([\"'>]+)#", '$1http://mydomain.com/$2$3', $html);
you can use (?!http|mailto) if you have also mailto links in your $html
$domain = 'http://mydomain';
preg_match_all('/href\="(.*?)"/im', $html, $matches);
foreach($matches[1] as $n=>$link) {
if(substr($link, 0, 4) != 'http')
$html = str_replace($matches[1][$n], $domain . $matches[1][$n], $html);
}
The previous answer will cause problems with your first and fourth example because it fails to include a forward slash to separate the page from the page name. Admittedly this can be fixed by simply appending it to the $domain, but if you do that then href="/something.php" will end up with two.
Just to give an alternative Regex solution you could go with something like this...
$pattern = '#'#(?<=href=")(.+?)(?=")#'';
$output = preg_replace_callback($pattern, 'make_absolute', $input);
function make_absolute($link) {
$domain = 'http://domain.com';
if(strpos($link[1], 'http')!==0) {
if(strpos($link[1], '/')!==0) {
return $domain.'/'.$link[1];
} else {
return $domain.$link[1];
}
}
return $link[1];
}
However it is worth noting that with a link such as href="example.html" the link is relative to the current directory neither method shown so far will work correctly for relative links that aren't in the root directory. In order to provide a solution that is though more information would be required about where the information came from.
Related
I'm trying to replace a title tag from |title|Page title| to <title>Page Title</title>, using this regular expression. But being a complete amateur, it's not gone to well..
'^|title|^[a-zA-Z0-9_]{1,}|$' => '<title>$1</title>'
I would love to know how to fix it, and more importantly, what I did wrong and why it was wrong.
You almost got it:
You should escape the | characters as they have special meaning in a
regex and you are using it as a plain character.
You should add the space character to your search group
$string = '|title|Page title|';
$pattern = '/\|title\|([a-zA-Z0-9_ ]{1,})\|/';
$replacement = '<title>$1</title>';
echo preg_replace($pattern, $replacement, $string); //echoes <title>Page title</title>
See working demo
OP posted some code in comments which is wrong, try this version:
$regular_expressions = array( array( '/\|title\|([a-zA-Z0-9_ ]{1,})\|/' , '<title>$1</title>' ));
foreach($regular_expressions as $regexp){
$data = preg_replace($regexp[0], $regexp[1], $data);
}
Heres a little function I came up with a while back to essentially scrape the titles of a page when users submitted links through my service. What this function does is will get the contents of a provided URL. Seek a title tag, if found, get whats between the title tag and dump it's result. With a little tweaking I am sure you can use a replace method for whatever your doing, and make it work for your needs. So this is more of a starting point rather than an answer but overall I hope it helps to some extent.
$url = 'http://www.chrishacia.com';
function get_page_title($url){
if( !($data = file_get_contents($url)) ) return false;
if( preg_match("#<title>(.+)<\/title>#iU", $data, $t)) {
return trim($t[1]);
} else {
return false;
}
}
var_dump(get_page_title($url));
<?php
$s = "|title|Page title|";
$s = preg_replace('/^\|title\|([^\|]+)\|/', "<title>$1</title>", $s);
echo $s;
?>
i am trying to catch all the images on a page using Xpath and then iterating through the node list checking if the image has attribute if it does i iterate through the attributes till i get to src now my problem is when i get relative paths like /us/english/images/12/something.jpeg or something like that.. my question is: is there a way go get the full path ?
I thought of regex the returned src and look for host if host isn't there use the site's url but that can be hard to check for..
i also thought maybe i should parse url and check for ['host'] part if the host part has "."dot meaning there is host and i shouldn't add it ?
Here is what i have so far:
$image_list = $xpath->query('//img');
foreach($image_list as $element){
if($element->hasAttributes()){
foreach($element->attributes as $attribute){
if(strtolower($attribute->nodeName) == 'src'){
echo $attribute->nodeName. ' = ' .$attribute->nodeValue.'<br>';
}
}
}
}
would appreciate any help.
Change your xpath query to //img[src]. This will return all the img elements that has src attribute. Use getAttribute method.your code will be shorter and efficient.
$image_list = $xpath->query("//img[#src]");
for($i=0;$i<$image_list->length; $i++){
echo "src = ". $image_list->item($i)->getAttribute("src"). "\n";
}
About the relative paths problem, you should find the base elements href attribute. If its found use it as base URI for relative urls. If its not found try to find the URL of this document. That'll be the base URI.
Update
As you want to read the image file path in the complex url like
//lp.hm.com/hmprod?set=key[source],value[/environment/2012/P01_2972_044R_0.jpg]&set=key[rotate],value[0.65]&set=key[width],value[2921]&set=key[height],value[3415]&set=key[x],value[1508]&set=key[y],value[495]&set=key[type],value[FASHION_FRONT]&call=url[file:/product/large]
you better use a custom parser like this,
$url = $image_list->item($i)->getAttribute("src");
$q = strpos($url, "?");
$query = substr($url, $q+1);
$params = explode("&", html_entity_decode($query));
$data = array();
foreach($params as $e){
if(preg_match("/key\[([^\]]+)\],value\[([^\]]+)\]/", $e, $m))
$data[$m[1]]=$m[2];
elseif(preg_match("/call=([^\[]+)\[([^\]]+)\]/", $e, $m))
$data[$m[1]]=$m[2];
}
print_r($data);
CodePad
I have a PHP function which takes a passed url and creates a clean link. It puts the full link in the anchor tags and presents just "www.domain.com" from the url. It works well but I would like to modify it so it strips out the "www." part as well.
<?php
// pass a url like: http://www.yelp.com/biz/my-business-name
// should return: yelp.com
function formatURL($url, $target=FALSE) {
if ($target) { $anchor_tag = "\\4"; }
else { $anchor_tag = "\\4"; }
$return_link = preg_replace("`(http|ftp)+(s)?:(//)((\w|\.|\-|_)+)(/)?(\S+)?`i", $anchor_tag, $url);
return $return_link;
}
?>
My regex skills are not that strong so any help greatly appreciated.
Take a look at parse_url: http://us2.php.net/manual/en/function.parse-url.php
This will simplify your logic quite a bit can can make replacing the www. a simple string replace.
$link = 'http://www.yelp.com/biz/my-business-name';
$hostname = parse_url($link, PHP_URL_HOST));
if(strpos($hostname, 'www.') === 0)
{
$hostname = substr($hostname, 4);
}
I have modified my original answer to account for the issue in the comments. The preg_replace in the post below will also work and is a bit more concise, I will leave this here to show an alternative solution that does not require invoking the regex engine if you desire.
This will get your the Domain name minus the www :
$url = preg_replace('/^www./', '', parse_url($url, PHP_URL_HOST));
^ in the regex means only remove www from the start of the string
Working example : http://codepad.org/FTNikw8g
Do I do something wrong?
I need the youtube code, but it doesn't return the real value.
if(preg_match_all("http:\/\/www\.youtube\.com\/v\/(.*)(.*)", $row->n_texto, $matches){
$code = $image_to_thumb .= "http://i1.ytimg.com/vi/".$matches[1][0]."/0.jpg";
}
Edit - ircmaxell Based on the comment, the link structure in the text is:
http:// www.youtube.com/v/plMvAh10HVg%26hl=en%26fs=1%26rel=0
Update
The problem is: my code return a link like this:
http://www.youtube.com/v/plMvAh10HVg%26hl=en%26fs=1%26rel=0
Can I stop it with regexp before appear %26hl=en%26fs=1%26rel=0?
Your regex is not correct. There are more than a few things wrong with it. Now, as far as what you want, try this:
#http://(?:.*)youtube.com/v/([^/\#?]+)#
Now, as for why, let's look at the regex:
http://(?:.*)youtube.com
You're looking for a string that starts with http://, has anything after (www., ww2., or nothing).
/v/
You're looking for /v/ as the start of the URL.
([^/\\#?]+)
You're looking for everything else UP TO another /, a query string (?) or a anchor (#). So that should match the ID you're looking for.
So, it would be
if(preg_match("#http://(?:.*)youtube.com/v/([^/\#?]+)#", $row->n_texto, $matches){
$code = $image_to_thumb .= "http://i1.ytimg.com/vi/".$matches[1]."/0.jpg";
}
If you wanted to find all:
if(preg_match_all("#http://(?:.*)youtube.com/v/([^/\#?]+)#", $row->n_texto, $matches){
foreach ($matches[1] as $match) {
$code = $image_to_thumb .= "http://i1.ytimg.com/vi/".$match."/0.jpg";
}
}
the link provided has a space before the 1st w in www.youtube.com, the code you need is :
if(preg_match_all("%http://www\.youtube\.com/v/([\w]+)%i", $row->n_texto , $matches)){
$code = $image_to_thumb .= "http://i1.ytimg.com/vi/".$matches[1][0]."/0.jpg";
}
also, the url you have is encoded, you may want to use urldecode($row->n_texto) before using it.
^http://\w{0,3}.?youtube+\.\w{2,3}/watch\?v=[\w-]{11}
according to http://www.regexlib.com/REDetails.aspx?regexp_id=2569
I've run into a hard problem to deal with. I am replacing a-tags and img-tags to fit my suggestions like this. So far so good.
$search = array('|(<a\s*[^>]*href=[\'"]?)|', '|(<img\s*[^>]*src=[\'"]?)|');
$replace = array('\1proxy2.php?url=', '\1'.$url.'/');
$new_content = preg_replace($search, $replace, $content);
Now my problem is that there are links on pages that i fetch the content of that looks like this:
<a href="/test/page/">
and
<a href="http://google.se/test/">
And when after replacing these two links looks like this:
<a href="proxy2.php?url=/test/page/">
and
<a href="proxy2.php?url=http://google.se/test/">
The problem is for me is that i want to include a variable named $url before /test/page/ and only on that links that are like that, not those who was already http:// or https:// before.
This should do the job for the anchor tags, at least:
<?php
function prepend_proxy($matches) {
$url = 'http://example.prefix';
$prepend = $matches[2] ? $matches[2] : $url;
$prepend = 'proxy2.php?url='. $prepend;
return $matches[1] . $prepend . $matches[3];
}
$new_content = preg_replace_callback(
'|(href=[\'"]?)(https?://)?([^\'"\s]+[\'"]?)|i',
'prepend_proxy',
$content
);
?>
Simply make your proxy2.php a little smarter. If a fully qualified URL comes in (http://...), redirect to that. If a local URL comes in (e.g. /test/page/), drop in what's missing (e.g. http://www.mylittleapp.com/test/page/) and redirect.
This would do the trick
$search = array('#(<a\s*[^>]*href=[\'"]?)(https?://)?#');
$replace = array('\1proxy2.php?url=');
$new_content = preg_replace($search, $replace, $content);
Result:
<a href="proxy2.php?url=/test/page/">
<a href="proxy2.php?url=google.se/test/">
it's me Sara. Scronide, your code did'nt work. It still returns:
<a href="proxy2.php?url=/test/page/">
<a href="proxy2.php?url=google.se/test/">
Instead of what i wanted it to show, i wanted it to show like this, with the url prepended:
<a href="proxy2.php?url=**THEURLHERE.COM**/test/page/">
<a href="proxy2.php?url=google.se/test/">
SORRY, IT DID WORK, I WAS DOING SOMETHING WRONG WITH THE URL VARIABEL. THANK U SCRONIDE!