How to exctract a single link in a webpage using PHP?

How to exctract a single link in a webpage using PHP? - php

I'm looking for a solution to extract only one URL from a specific webpage using PHP.
Here's a simple example of what I need:
I have a URL with many links (https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details)
I want to scrape the link under the anchor click here from the current page.
Then the code must return this result https://download.apkpure.com/b/XAPK/Y29tLnhpYW9taS5zbWFydGhvbWVfNjMwNjdfYWU1M2FmOWU?_fn=TWkgSG9tZV92NS44LjdfYXBrcHVyZS5jb20ueGFwaw&as=4c5e64f6f957edac834f3631fe4e09715f2e35f6&ai=-1070628217&at=1596863870&_sa=ai%2Cat&k=24cb20f95fbf333deb01c145ce7b982b5f30d87e&_p=Y29tLnhpYW9taS5zbWFydGhvbWU&c=1%7CLIFESTYLE%7CZGV2PVhpYW9taSUyMEluYy4mdD14YXBrJnM9MTI5OTAzMTM4JnZuPTUuOC43JnZjPTYzMDY3.
I tried this:
$sourceURL="https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details";
$htmlSource=htmlentities(file_get_contents($sourceURL));
echo strip_tags($htmlSource, "<a>");
I get the result with all links including the one I need
I need your help to extract the href value of the link I want.
Thanks in advance.

If you look at the required URL, you can see it has a pattern https://download.apkpure.com at start of each Click Here URL, so, we can use regex to find it.
preg_match_all will return an array of strings that will match our regex. Then I have used implode to convert the first index to a string.
Here is the complete working code:
$sourceURL="https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details";
$content=file_get_contents($sourceURL);
$content = strip_tags($content,"<a>");
preg_match_all('#\bhttps?://download.apkpure.com[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $content, $match);
echo implode(', ', $match[0]);

Most elegant way is to use a DOM parser.
Iterate thru anchors
Check if anchor ID is 'download_link' (which is in the 'click here' link)
Extract the href attribute value
$html = file_get_contents('https://apkpure.com/mi-home/com.xiaomi.smarthome/download?from=details');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($html);
$href = '';
foreach($doc->getElementsByTagName('a') as $item) {
if($item->getAttribute('id') == 'download_link') {
$href = $item->getAttribute('href');
break;
}
}
echo $href;
https://download.apkpure.com/b/XAPK/Y29tLnhpYW9taS5zbWFydGhvbWVfNjMwNjdfYWU1M2FmOWU?_fn=TWkgSG9tZV92NS44LjdfYXBrcHVyZS5jb20ueGFwaw&as=6a7de2cb660007a32e4b3d61a0d3c41e5f2e7102&ai=1946881098&at=1596878986&_sa=ai%2Cat&k=9e912b1007d50d2be9af8e78bcdea86c5f31138a&_p=Y29tLnhpYW9taS5zbWFydGhvbWU&c=1%7CLIFESTYLE%7CZGV2PVhpYW9taSUyMEluYy4mdD14YXBrJnM9MTI5OTAzMTM4JnZuPTUuOC43JnZjPTYzMDY3

Related

How to use regex for when link contains a specific string? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Grabbing the href attribute of an A element
I need to parse all links of an HTML document that contain some word (it's always different).
Example:
BLA
BLA
BLA
I only need the links with "href=/link: ...." what's the best way to go for it?
$html = "SOME HTLM ";
$dom = new DomDocument();
#$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
foreach ($urls as $url)
{
echo "<br> {$url->getAttribute('href')} , {$url->getAttribute('title')}";
echo "<hr><br>";
}
In this example all links are shown, I need specific links.

By using a condition.
<?php
$lookfor='/link:';
foreach ($urls as $url){
if(substr($url->getAttribute('href'),0,strlen($lookfor))==$lookfor){
echo "<br> ".$url->getAttribute('href')." , ".$url->getAttribute('title');
echo "<hr><br>";
}
}
?>

Instead of first fetching all the a elements and then filtering out the ones you need you can query your document for those nodes directly by using XPath:
//a[contains(#href, "link:")]
This query will find all a elements in the document which contain the string link: in the href attribute.
To check whether the href attribute starts with link: you can do
//a[starts-with(#href, "link:")]
Full example (demo):
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[contains(#href, "link:")]') as $a) {
echo $a->getAttribute('href'), PHP_EOL;
}
Please also see
Implementing condition in XPath
excluding URLs from path links?
PHP/XPath: find text node that "starts with" a particular string?
PHP Xpath : get all href values that contain needle
for related questions.
Note: marking this CW because of the many related questions

Use regular expressions.
foreach ($urls as $url)
{
$href = $url->getAttribute('href');
if (preg_match("/^\/link:/",$href){
$links[$url->getAttribute('title')] = $href;
}
}
$links array contains all of the titles and href's that match.

As getAttribute simply returns a string you only need to check what it starts with with strpos().
$href = $url -> getAttrubute ('href');
if (strpos ($href, '/link:') === 0)
{
// Do your processing here
}

How to use preg_match_all search if HTML source contains given URL?

I want to find all href tags that include my URL in any html source.
I used this code:
preg_match_all("'<a.*?href=\"(http[s]*://[^>\"]*?)\"[^>]*?>(.*?)</a>'si", $target_source, $matches);
Example, I try to find a href tags that include http://www.emrekadan.com
How can I do it ?

I'd simply use PHP's DOM Parser for this purpose. This may seem harder than regex, but it's actually a lot more easier and is the correct way to parse HTML.
$url = 'WEBSITE_TO_SEARCH_FOR';
$searchstring = 'YOUR_SEARCH_STRING';
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$result = array();
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if(stripos($href, $searchstring) !== FALSE) {
$result[] = $href;
}
}
if(!empty($result)) print_r($result);
Explanation:
Loads the given URL using loadHTMLfile() method
Finds all <a> tags and loops through them
Uses stripos() to case-insensitively check if the href contains the given search term
If it does, it's pushed into the $result array
Note: If an empty string is passed as the filename or an empty file is named, a warning will be generated. I've used # to hide that message, but it's generally regarded as a bad practice. You can add additional checks to make sure the URL exists before trying to load it.

How do I cut this string in PHP?

I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny

Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);

try this
>>10028949<br><br>who that guy???

Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...

Replace an element with Dom Document PHP

I load a html page with PHP Dom Document :
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
I search in my page all "a" elements, and if they realize my condition i need to replace for example My link is beautiful by just My link is beautiful
Here my loop :
$liens = $div->getElementsByTagName('a');
foreach($liens as $lien){
if($lien->hasAttribute('href')){
if (preg_match("/metz2/i", $lien->getAttribute('href'))) {
//HERE I NEED TO REPLACE </a>
}
$cpt++;
}
}
Do you have any ideas ? Suggestions ? Thanks :)

Every time i need to manage DOM with PHP, i use a framework called PHP Simple HTLM DOM parser. (Link here)
It's very easy to use, something like this might work for you:
// Create DOM from URL or file
$html = file_get_html('http://www.page.com/');
// Find all links
foreach($html->find('a') as $element) {
//Do your custom logic here if you need it, for example this extracts the inner contents of the a-tag, and puts it freely.
$inner = $element->innertext;
$element->outertext($inner);
}
//To echo modified html again:
echo $html;

Could be done with preg_replace as well:
$sText = 'Stackoverflow';
$sText = preg_replace( '/<a.*>(.*)<\/a>/', '$1', $sText );
echo $sText;

Parse All Links That Contain A Specific Word In "href" Tag [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Grabbing the href attribute of an A element
I need to parse all links of an HTML document that contain some word (it's always different).
Example:
BLA
BLA
BLA
I only need the links with "href=/link: ...." what's the best way to go for it?
$html = "SOME HTLM ";
$dom = new DomDocument();
#$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');
foreach ($urls as $url)
{
echo "<br> {$url->getAttribute('href')} , {$url->getAttribute('title')}";
echo "<hr><br>";
}
In this example all links are shown, I need specific links.

By using a condition.
<?php
$lookfor='/link:';
foreach ($urls as $url){
if(substr($url->getAttribute('href'),0,strlen($lookfor))==$lookfor){
echo "<br> ".$url->getAttribute('href')." , ".$url->getAttribute('title');
echo "<hr><br>";
}
}
?>

Instead of first fetching all the a elements and then filtering out the ones you need you can query your document for those nodes directly by using XPath:
//a[contains(#href, "link:")]
This query will find all a elements in the document which contain the string link: in the href attribute.
To check whether the href attribute starts with link: you can do
//a[starts-with(#href, "link:")]
Full example (demo):
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[contains(#href, "link:")]') as $a) {
echo $a->getAttribute('href'), PHP_EOL;
}
Please also see
Implementing condition in XPath
excluding URLs from path links?
PHP/XPath: find text node that "starts with" a particular string?
PHP Xpath : get all href values that contain needle
for related questions.
Note: marking this CW because of the many related questions

Use regular expressions.
foreach ($urls as $url)
{
$href = $url->getAttribute('href');
if (preg_match("/^\/link:/",$href){
$links[$url->getAttribute('title')] = $href;
}
}
$links array contains all of the titles and href's that match.

As getAttribute simply returns a string you only need to check what it starts with with strpos().
$href = $url -> getAttrubute ('href');
if (strpos ($href, '/link:') === 0)
{
// Do your processing here
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to exctract a single link in a webpage using PHP? - php

Related

How to use regex for when link contains a specific string? [duplicate]

How to use preg_match_all search if HTML source contains given URL?

How do I cut this string in PHP?

Replace an element with Dom Document PHP

Parse All Links That Contain A Specific Word In "href" Tag [duplicate]

Categories

Resources