How to get links with mp3 as extension - php

I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);

Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);

I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.

Related

PHP Simple HTML Dom Parser code is not working. Output is blank

I was trying to scrape the data from "non-secured" url that is using 'http' instead of 'https'.
Here is the code
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('/html/body/div[1]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
add_shortcode('shortcode_name2', 'display_html_info2');
I have also tried using XPath
//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3
In both the cases, it shows blank output. Means No Value.
Please let me know how this will work.
I have included the html_dom_parser.php
I tried the above mentioned code but it is giving No Value as Output. Instead, it is showing blank space where is use shortcode [shortcode_name2] to show output of the above code.
Additional
I have tried #Pinke Helga method but does not work for me. That's what I did
declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
if (!is_string($html)) {
return 'Error: Could not retrieve the HTML content.';
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
echo display_html_info2();
add_shortcode('shortcode_name2', 'display_html_info2');
And that's what I got. "Error: Could not retrieve the HTML content."
It looks as you have generated the xpath expression from browser dev-tools. The browser extends some HTML. There is no <tbody> in the original source.
Use the xpath expression //*#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3
Complete code:
<?php declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
// var_dump($h3_element);
return $h3_element->nodeValue;
}
echo display_html_info2(); // DEBUG output
Current result:
21.898 OMR

Get element from giving URL with dom php

Building an MTDB DB with php, and need to scrape a specific tag from the URL.
Tag to get from url
vars.disqus = '';
vars.lists = [];
vars.titleId = '35079';
vars.trailersPlayer = 'default';
vars.userId = '907791';
vars.title = {"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec}
I need the
"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec
My code:
$html = 'myurl';
libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->loadHTMLFile($html); libxml_clear_errors();
$xp = new DOMXpath($dom); $nodes = $xp->query('//script[#\'id','trailer','title');
echo $nodes->item(0)->nodeValue;
the "Tag" is not a HTML format, its looks like some javascript code ~~
to resolve these string, simply via regex
preg_match('/title\s*=\s*\{([^}]+)}/', $str, $matches);
var_dump($matches[1]);

Q: How to find a section from a web page without using XPath

I need to extract a section from a web page. I need a version with DOM API and without XPath. This is my version. Need to extract from "Latest Distributions" and display the information in browser.
<?php
$result = file_get_contents ('https://distrowatch.com/');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$node = $xpath->query('//table[#class="News"]')->item(0);
echo $node->textContent;
This seems pretty straightforward, but it's a waste of time to do this instead of XPath.
<?php
$result = file_get_contents ('https://distrowatch.com/');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($result);
foreach ($doc->getElementsByTagName("table") as $table) {
if ($table->getAttribute("class") === "News") {
echo $table->textContent;
break;
}
}

print_r for nodeList is not working

I have the following source code:
<?php
function getTerms()
{
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('https://charitablebookings.com/terms'); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nodeList = $xpath->query("//div[#class='terms-conditions']");
$temp_dom = new DOMDocument();
$node = $nodeList->item(0);
$temp_dom = new DOMDocument();
foreach($nodeList as $n) $temp_dom->appendChild($temp_dom->importNode($n,true));
print_r($temp_dom->saveHTML());
}
getTerms();
?>
which I'm trying to get a text from a web page by getting a specific class. I don't get anything on my browser when I try to print_r the temp_dom. And $node is null. What am I doing wrong ?
Thanks for your time
The first issue is that DOMDocument's loadHTML method expects HTML content as its first parameter, not an URL.
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents('https://charitablebookings.com/terms');
$doc->loadHTML($html);
And the second problem is with your XPath expression: $xpath->query("//div[#class='terms-conditions']") - as there is no div with class of terms-conditions in the document (it probably gets added by some JavaScript loader).

PHP nodeValue strips html tags - innerHTML alternative?

I'm using the following script for a lightweight DOM editor. However, nodeValue in my for loop is converting my html tags to plain text. What is a PHP alternative to nodeValue that would maintain my innerHTML?
$page = $_POST['page'];
$json = $_POST['json'];
$doc = new DOMDocument();
$doc = DOMDocument::loadHTMLFile($page);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[#class="editable"]');
$edits = json_decode($json, true);
$num_edits = count($edits);
for($i=0; $i<$num_edits; $i++)
{
$entries->item($i)->nodeValue = $edits[$i]; // nodeValue strips html tags
}
$doc->saveHTMLFile($page);
Since $edits[$i] is a string, you need to parse it into a DOM structure and replace the original content with the new structure.
Update
The code fragment below does an incredible job when using non-XML compliant HTML. (e.g. HTML 4/5)
for($i=0; $i<$num_edits; $i++)
{
$f = new DOMDocument();
$edit = mb_convert_encoding($edits[$i], 'HTML-ENTITIES', "UTF-8");
$f->loadHTML($edit);
$node = $f->documentElement->firstChild;
$entries->item($i)->nodeValue = "";
foreach($node->childNodes as $child) {
$entries->item($i)->appendChild($doc->importNode($child, true));
}
}
I haven't working with that library in PHP before, but in my other xpath experience I think that nodeValue on anything other than a text node does strip tags. If you're unsure about what's underneath that node, then I think you'll need to recursively descend $entries->item($i)->childNodes if you need to get the markup back.
Or...you may wany textContent instead of nodeValue:
http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent

Categories