Cropping hyperlinks from block of text with PHP - php

I have the following HTML on my webpage:
<p>This is a hyperlink and this is another hyperlink. There are many like it, but this one is mine.</p>
Now, I was wondering...
Is there any way, I can use a PHP function to split this block of text up into an array?
$html[0] = "<p>This is a & this is another . There are many like it, but .</p>";
$html[1] = "http://www.google.com/";
$html[2] = "http://www.bing.com/";
$html[3] = "http://en.wikipedia.org/wiki/Full_Metal_Jacket";
So, basically stripping the initial block of text of all hyperlinks and storing them all in their own array element.
Many thanks for any help with this.

Use this RegEx to get URL's of html:
$url = "http://www.example.net/somepage.html";
$input = #file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
?>

Related

How to exctract a single link in a webpage using PHP?

I'm looking for a solution to extract only one URL from a specific webpage using PHP.
Here's a simple example of what I need:
I have a URL with many links (https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details)
I want to scrape the link under the anchor click here from the current page.
Then the code must return this result https://download.apkpure.com/b/XAPK/Y29tLnhpYW9taS5zbWFydGhvbWVfNjMwNjdfYWU1M2FmOWU?_fn=TWkgSG9tZV92NS44LjdfYXBrcHVyZS5jb20ueGFwaw&as=4c5e64f6f957edac834f3631fe4e09715f2e35f6&ai=-1070628217&at=1596863870&_sa=ai%2Cat&k=24cb20f95fbf333deb01c145ce7b982b5f30d87e&_p=Y29tLnhpYW9taS5zbWFydGhvbWU&c=1%7CLIFESTYLE%7CZGV2PVhpYW9taSUyMEluYy4mdD14YXBrJnM9MTI5OTAzMTM4JnZuPTUuOC43JnZjPTYzMDY3.
I tried this:
$sourceURL="https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details";
$htmlSource=htmlentities(file_get_contents($sourceURL));
echo strip_tags($htmlSource, "<a>");
I get the result with all links including the one I need
I need your help to extract the href value of the link I want.
Thanks in advance.
If you look at the required URL, you can see it has a pattern https://download.apkpure.com at start of each Click Here URL, so, we can use regex to find it.
preg_match_all will return an array of strings that will match our regex. Then I have used implode to convert the first index to a string.
Here is the complete working code:
$sourceURL="https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details";
$content=file_get_contents($sourceURL);
$content = strip_tags($content,"<a>");
preg_match_all('#\bhttps?://download.apkpure.com[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $content, $match);
echo implode(', ', $match[0]);
Most elegant way is to use a DOM parser.
Iterate thru anchors
Check if anchor ID is 'download_link' (which is in the 'click here' link)
Extract the href attribute value
$html = file_get_contents('https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$href = '';
foreach($doc->getElementsByTagName('a') as $item) {
if($item->getAttribute('id') == 'download_link') {
$href = $item->getAttribute('href');
break;
}
}
echo $href;
https://download.apkpure.com/b/XAPK/Y29tLnhpYW9taS5zbWFydGhvbWVfNjMwNjdfYWU1M2FmOWU?_fn=TWkgSG9tZV92NS44LjdfYXBrcHVyZS5jb20ueGFwaw&as=6a7de2cb660007a32e4b3d61a0d3c41e5f2e7102&ai=1946881098&at=1596878986&_sa=ai%2Cat&k=9e912b1007d50d2be9af8e78bcdea86c5f31138a&_p=Y29tLnhpYW9taS5zbWFydGhvbWU&c=1%7CLIFESTYLE%7CZGV2PVhpYW9taSUyMEluYy4mdD14YXBrJnM9MTI5OTAzMTM4JnZuPTUuOC43JnZjPTYzMDY3

Get Webpage Title from URL in PHP not working

I'm trying to receive a certain url through post and scrape the title of that HTML page. Then, I will store the title of the page into my MySQL Database.
Before Implementing this feature to my actual online server, I tested the page_title function (which is the custom function that reads the title of the HTML page of a given URL) on my local server, and it worked fine. Here is the code I used on my local server.
<?php
$link = $_POST['link'];
function page_title($url) {
$fp = file_get_contents($url);
if (!$fp)
return null;
$res = preg_match("/<title>(.*)<\/title>/siU", $fp, $title_matches);
if (!$res)
return null;
// Clean up title: remove EOL's and excessive whitespace.
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
return $title;
}
$title= page_title($link);
echo $title; ?>
However, when I used this exact same code on my online server to actually push the data in to the MYSQL Database, the function seems to return nothing but an empty string. As a result, whenever I check my php myadmin, nothing appears on the "title" column. Can anyone please tell me what I can do to make this work? Thank you!
I suggest simplifying it by doing this (remove the comments as it is too much info):
<?PHP
# Get the HTML from a web page
$html = file_get_contents("http://whatever.url");
# Get all HTML titles in to an array (this is your own code)
$res = preg_match("/<title>(.*)<\/title>/siU", $html, $titleArray);
# Get the first array entry - and an empty string if the tag does not exists
$title = isset($titleArray[0]) ? $titleArray[0] : "";
# Remove HTML tags from the string
$title = strip_tags($title);
# Show the title - convert HTML tags just to show it does not have any
echo "[". htmlentities($title) ."]";
# Save it to your database ...
?>

preg_replace with wildcards?

I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.
You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation

Modification to a code to merge two parts of it with similar characteristics

Below is a link crawler that gets the urls of a page in a given depth. At the end of it I added a regular expression to match all the emails of the url that is just crawled. As you can see in the second part, it file_get_content the same page it just downloaded, meaning twice the execution time, bandwidth etc.
The question is how can I merge those two parts to use the first downloaded page, to avoid getting it again? Thank you.
function crawler($url, $depth = 2) {
$dom = new DOMDocument('1.0');
if (!$parts || !#$dom->loadHTMLFile($url)) {
return;
}
.
.
.
//this is where the second part starts
$text = file_get_contents($url);
$res = preg_match_all("/[a-z0-9]+([_\\.-][a-z0-9]+)*#([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i", $text, $matches);
}
Replace:
$text = file_get_contents($url);
with:
$text = $dom->saveHTML();
http://www.php.net/manual/en/domdocument.savehtml.php
Alternatively, in the first part of your function, you could save the HTML into a variable using file_get_contents, then pass it to $dom->loadHTML. That way you can then reuse the variable with your regex.
http://www.php.net/manual/en/domdocument.loadhtml.php

Scrape unique image URLs from HTML

Using PHP to curl a web page (some URL entered by user, let's assume it's valid).
Example: http://www.youtube.com/watch?v=Hovbx6rvBaA
I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).
I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.
What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.
<?php
$resultFromCurl = '
<html>
<body>
<img src="hello.jpg" />
Yep
<table background="yep.jpg">
</table>
<p>
Perhaps you should check out foo.jpg! I promise it
is safe for work.
</p>
</body>
</html>
';
// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
'//table/#background',
'//img/#src',
'//input/#src',
'//a/#href',
'//area/#href',
'//img/#longdesc',
);
$dom = #DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);
$urls = array();
foreach ($queries as $query) {
foreach ($xpath->query($query) as $link) {
if (preg_match('#\.(gif|jpe?g|png)$#', $link->textContent))
$urls[$link->textContent] = true;
}
}
if (preg_match_all('#\b[^\s]+\.(?:gif|jpe?g|png)\b#', $dom->textContent, $matches)) {
foreach ($matches as $m) {
$urls[$m[0]] = true;
}
}
$urls = array_keys($urls);
var_dump($urls);
Collect all image urls into an array, then use array_unique() to remove duplicates.
$my_image_links = array_unique( $my_image_links );
// No more duplicates
If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:
$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);
The above will capture the image link in stuff like:
<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>
Live example

Categories