Convert all Relative urls to Absolute urls while maintaining contents - php

I am scrapping site data using simple html dom, but i get problem when converting relative urls to absolute url.. imagine direct page link is http://www.example.com/tutorial.html but when i get contents i want, there are relative links that i want all of them to be absolute. for example:
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='files/doc.doc'>This file</a>";
i want to get something like:
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='http://www.example.com/docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='http://www.example.com/files/doc.doc'>This file</a>";
just to covert all relative urls to absolute urls while maintaining $string contents.
When trying solution given below, does not work for the real data scrapped..
//These are Real Data from scrapped html
//Base URL is http://www.zoomtanzania.com/
// [^>]* means 0 or more quantifiers except for >
$regex = '~<a([^>]*)href=\'([^\']*)\'([^>]*)>~';
// replacement for each subpattern (3 in total)
// basically here we are adding missing baseurl to href
$replace = '<a$1href="http://www.zoomtanzania.com/$2"$3>';
$string =
'<div style="background-color: rgba(255, 255, 255, 0.8);">
<div style="font-size:17px; font-weight:bold; ">
MECHANICAL TECHNICIAN</div>
<hr style="margin:4px">
<div>
<p class="pull-right">Application Deadline: 24 Jul 2015<br></p>
<h5>Mechanical Technician POSITION DESCRIPTION:</h5><br>
Position Description Document (download)
<br>
<br>
<h5>APPLICATION INSTRUCTIONS:</h5><br>
<p>
All applications should be sent to the address below or via <strong>APPLY NOW</strong> below before 24th July 2015.</p>
<p>Eligible candidates are required to submit detailed CV with names of three referees and an application letter.</p>
<p>
<br>P.O.BOX 4955,<br>Dar es Salaam,</p>
<p>Tanzania.</p>
<br>
<br>
</div>
</div>';
$replaced = preg_replace($regex, $replace, $string);
echo $replaced;
//Method does not replace Position Description Document (download) to Position Description Document (download)

You were right to use preg_replace, for your example you can try this code
// [^>]* means 0 or more quantifiers except for >
// single quote AND double quote support
$regex = '~<a([^>]*)href=["\']([^"\']*)["\']([^>]*)>~';
// replacement for each subpattern (3 in total)
// basically here we are adding missing baseurl to href
$replace = '<a$1href="http://www.example.com/$2"$3>';
$string = "<p>this is text within string</p> and more random strings which contains link like <a href='docs/555text.fileextension'>Download this file</a> <p>Other html follows where another relative link may exist like <a href='files/doc.doc'>This file</a>";
$replaced = preg_replace($regex, $replace, $string);
Result
<p>this is text within string</p> and more random strings which contains link like Download this file <p>Other html follows where another relative link may exist like This file

Related

PHP can't take <img> tag from page

I have a problem with PHP preg_match function.
In CMS DLE, I try to extract a picture from the news (image-x), but in the module I'm referring to via a direct link.
//remove <p></p> tags
$row[$i]['short_story'] = str_replace( "</p><p>", " ",$row[$i]['short_story'] );
//remove the \" escapes (DLE put it in the MySQL column)
$row[$i]['short_story'] = str_replace("\\\"", " ", $row[$i]['short_story']);
//remove all tags except <img>, but there remains a simple text that is stored without tags
$row[$i]['img'] = strip_tags($row[$i]['short_story'], "<img>");
//try to find <img> (by '>'), to remove the simple text;
preg_match(".*>", $row[$i]['img'], $matches);
// print only <br/> (matches is empty)
print_r($matches."<br/>\n");
for example print_r($row[$i]['img']) is
<img src="somelink" class="fr-fic" fr-dib="" alt=""> Some text
And i need only
<img src="somelink" class="fr-fic" fr-dib="" alt="">
Your regex pattern to selecting <img> is incorrect. Use /<img[^>]+>/ in pattern instead. The code should change to
preg_match("/<img[^>]+>/", $row[$i]['img'], $matches);
Also you can use preg_replace() to removing additional text after <img>
preg_replace("/(<img[^>]+>)[\w\s]+/", "$1", $string)

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.
I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger
To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}

PHP cut text from a specific word in an HTML string

I would like to cut every text ( image alt included ) in an HTML string form a specific word.
for example this is the string:
<?php
$string = '<div><img src="img.jpg" alt="cut this text form here" />cut this text form here</div>';
?>
and this is what I would like to output
<div>
<a href="#">
<img src="img.jpg" alt="cut this text" />
cut this text
</a>
</div>
The $string is actually an element of an Object but I didn't wanted to put too long code here.
Obviously I can't use explode because that would kill the HTML markup.
And also str_replace or substr is out because the length before or after the word where it needs to be cut is not constant.
So what can I do to achive this?
Ok I solved my problem and I only post an answer to my question because it could help someone.
so this is what I did:
<?php
$string = '<div><img src="img.jpg" alt="cut this text form here" />cut this text form here</div>';
$txt_only = strip_tags($string);
$explode = explode(' from', $txt_only);
$find_txt = array(' from', $explode[1]);
$new_str = str_replace($find_txt, '', $string);
echo $new_str;
?>
This might not be the best solution but it was quick and did not involve DOM Parse.
If anybody wants to try this make sure that your href or src or any ather attribute what needs to be untouched doesn't have any of the chars in the same way and order as in $find_txt else it will replace those too.

SIMPLE HTML DOM - how to ignore nested elements?

My html code is as follows
<span class="phone">
i want this text
<span class="ignore-this-one">01234567890</span>
<span class="ignore-this-two" >01234567890</span>
<a class="also-ignore-me">some text</a>
</span>
What I want to do is extract the 'i want this text' leaving all of the other elements behind. I've tried several iterations of the following, but none return the text I need:
$name = trim($page->find('span[class!=ignore^] a[class!=also^] span[class=phone]',0)->innertext);
Some guidance would be appreciated as the simple_html_dom section on filters is quite bare.
what about using php preg_match (http://php.net/manual/en/function.preg-match.php)
try the below:
<?php
$html = <<<EOF
<span class="phone">
i want this text
<span class="ignore-this-one">01234567890</span>
<span class="ignore-this-two" >01234567890</span>
<a class="also-ignore-me">some text</a>
</span>;
EOF;
$result = preg_match('#class="phone".*\n(.*)#', $html, $matches);
echo $matches[1];
?>
regex explained:
find text class="phone" then proceed until the end of the line, matching any character using *.. Then switch to a new line with \n and grab everything on that line by enclosing *. into brackets.
The returned result is stored in the array $matches. $matches[0] holds the value that is returned from the whole regex, while $matches[1] holds the value that is return by the closing brackets.

How do I find scrape information between 2 tags?

I am trying to scrape information with PHP that has their data like so:
<br>1998 - <a href="http://example.com/movie/id/2345">A Night at the Roxburry<a/>
I need to get the year that is between the <br> and the <a> tag. I have gotten the title of the movie by using PHP Simple DOM HTML parser. This was the code that I used to parse the title
foreach($dom->getElementsByTagName('a') as $link){
$title = $link->getAttribute('href');
}
I tried using:
$string = '<br>1998 - <a href="http://example.com/movie/id/2345">A Night at the Roxburry<a/>';
$year = preg_match_all('/<br>(.*)<a>', $string);
But it's not finding the year that is in between the <br> and the <a> tag. Does anyone know what I could possibly do to find the year?
Try this:
<?php
$subject = '<br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry<a/>';
$pattern = '/<br>[0-9]{4}/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>
Note that you can change pattern if year is shown in some other formats. If you want to see everything between two tags you can use $pattern = '/<br>.*<a/'; or any other appropriate for you.
The expression you are using: $year = preg_match_all('/<br>(.*)<a>', $string); will find text between <br> and <a>, but in your example you do not have <a> anywhere. Try looking for text between <br> and <a like this:
$year = preg_match_all ('/<br>([^<]*)<a/', $string);
note, that I also changed . to [^<] to make sure it will stop at the next tag, otherwith it will match strings like this:
<br>foo<br><br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry<a
because they start with <br> and end with <a, but this is probably not what you need, any your year will be like this:
foo<br><br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry

Categories