How would I parse this html in php? - php

I've exported my Firefox bookmarks as html so I can download my extensive music collection onto my phone, my problem is there is no easy way that I know of.
My intentions is to use PHP to parse the html into an array of the URLs
Heres what the html looks like
<DT>Don Diablo - Knight Time (Official Music Video) - YouTube
How would I do this?

If you put in $html a correct html string, you could do it parsing the string with DOMDocument and selecting the href attributes with XPath.
<?php
$html = '<DT>Don Diablo - Knight Time (Official Music Video) - YouTube';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//a/#href");
$links_array = [];
foreach($nodeList as $node){
$links_array[] = $node->nodeValue;
}
echo "<pre>";
print_r($links_array);
echo "</pre>";
The output here is:
Array
(
[0] => https://www.youtube.com/watch?v=Ue8PpA557Bc
)

$doc = new DOMDocument();
$doc->loadHTML($bookmarks);
foreach ($doc->getElementsByTagName("a") as $node) {
$urls[] = $node->getAttribute("href");
}

Related

Get current song from Icecast using XPath

So I tried to get the "Current Song" as echo in a PHP file using XPath.
I tried to echo the file_get_content and it returns the webpage I'm trying to get the content from, however it seems that I can't filter the webpage content using XPath. It should echo only the Current Song.
This is what I've tried:
<?php
libxml_use_internal_errors(false);
$html = file_get_contents('http://185.40.20.83/radio/8000/');
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$node = $xpath->query('/html/body/div/div[1]/div[2]/table/tbody/tr[10]/td[2]')->item(0);
echo $node->textContent;
?>
I'm trying this for over one hour and I'm loosing hope because I don't see what's the problem...
Try changing your $node to :
$node = $xpath->query('//table//tr[./td[text()="Current Song:"]]/td[2]')->item(0);
or
$node = $xpath->query('//table//tr[./td[text()="Current Song:"]]/td[2]');
echo $node[0]->nodeValue;
Output:
Chmst - Pump Up The Jam

php - loadHTML() - every <p> until a certain class

I'm calling some wikipedia content two different way:
$html = file_get_contents('https://en.wikipedia.org/wiki/Sans-serif');
The first one is to call the first paragraph
$dom = new DomDocument();
#$dom->loadHTML($html);
$p = $dom->getElementsByTagName('p')->item(0)->nodeValue;
echo $p;
The second one is to call the first paragraph after a specific $id
$dom = new DOMDocument();
#$dom->loadHTML($html);
$p=$dom->getElementById('$id')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
I'm looking for a third way to call all the first part.
So I was thinking about calling all the <p> before the id or class "toc" which is the id/class of the table of content.
Any idea how to do that?
If you're just looking for the intro in plain text, you can simply use Wikipedia's API:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sans-serif
If you want HTML formatting as well (excluding inner images and the likes):
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&titles=Sans-serif
You could use DOMDocument and DOMXPath with for example an xpath expression like:
//div[#id="toc"]/preceding-sibling::p
$doc = new DOMDocument();
$doc->load("https://en.wikipedia.org/wiki/Sans-serif");
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#id="toc"]/preceding-sibling::p');
foreach ($nodes as $node) {
echo $node->nodeValue;
}
That would give you the content of the paragraphs preceding the div with id = toc.

XPath Substring-After Help / Query/Evaluate?

I'm building a php script to transfer selected contents of an xml file to an sql database..
One of the hardcoded XML contents is formatted like this:
<visualURL>
id=18144083|img=http://upload.wikimedia.org/wikipedia/en/8/86/Holyrollernovacaine.jpg
</visualURL>
And I'm looking for a way to just get the contents of the URL (all text after img=).
$Image = $xpath->query("substring-after(/Playlist/PlaylistEntry[1]/visualURL[1]/text(), 'img=')", $element)->item(0)->nodeValue;
Displays a property non-object error on my php output.
There must be another way to just extract the URL contents using XPath that I want, no?
Any help would be greatly appreciated!
EDIT:
Here is the minimum code
<?php
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML('<Playlist>
<PlaylistEntry>
<visualURL>
id=12582194|img=http://upload.wikimedia.org/wikipedia/en/9/96/Sometime_around_midnight.jpg
</visualURL>
</PlaylistEntry>
</Playlist>');
$xpath = new DOMXpath($xmlDoc);
$elements = $xpath->query("/Playlist/PlaylistEntry[1]");
if (!is_null($elements))
foreach ($elements as $element)
$Image = $xpath->query("substring-after(/Playlist/PlaylistEntry[1]/visualURL[1]/text(), 'img=')", $element)- >item(0)->nodeValue;
print "Finished Item: $Image";
?>
EDIT 2:
After some research I believe I must use
$xpath->evaluate
instead of my current use of
$xpath->query
see this link
Same XPath query is working with Google docs but not PHP
I'm not exactly sure how to do this yet.. but i will investigate more in the morning. Again, any help would be appreciated.
You're in right direction. Use DOMXPath::evaluate() for xpath expression that doesn't return node(s) like substring-after() (it returns string as documented in the linked page). The following codes prints expected output :
$xmlDoc = new DOMDocument();
$xml = <<<XML
<Playlist>
<PlaylistEntry>
<visualURL>
id=12582194|img=http://upload.wikimedia.org/wikipedia/en/9/96/Sometime_around_midnight.jpg
</visualURL>
</PlaylistEntry>
</Playlist>
XML;
$xmlDoc->loadXML($xml);
$xpath = new DOMXpath($xmlDoc);
$elements = $xpath->query("/Playlist/PlaylistEntry");
foreach ($elements as $element) {
$Image = $xpath->evaluate("substring-after(visualURL, 'img=')", $element);
print "Finished Item: $Image <br>";
}
output :
Finished Item: http://upload.wikimedia.org/wikipedia/en/9/96/Sometime_around_midnight.jpg
Demo

xpath extract complete html

I am trying to extract a complete table including the HTML tags, with XPath, that I can store in a variable, do a bit of string replacement on, then echo directly to the screen. I have found numerous posts on getting the text out of the table but I want to retain the HTML formatting since I am just going to display it (after minor modification).
At present I am extracting the table using string functions stristr, substr etc. but I would prefer to use XPath.
I can display the contents of the table with the following but it just displays the table TD fields with no formatting. It also does not store it in a variable that I can manipulate.
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
foreach($arr as $el) {
echo $el->textContent;
I tried this but got no output:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
echo $arr->saveHTML();
Use DOMNode::C14N():
foreach($arr as $el) {
echo $el->C14N();

How to scrape links from a a page with DOM & XPath?

I have a page scraped with curl and am looking to grab all of the links with a certain id. As far as I can tell the best way to do this is with dom and xpath. The bellow code grabs a large number of the urls, but cuts many of them off and grabs text that is not a url.
$curl_scraped_page is the page scraped with curl.
$dom = new DOMDocument();
#$dom->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Am I on the right track? Do I just need to mess with the "/html/body//a" xpath syntax or do I need to add more to capture the id element?
You can also do it this way and you'll have onyl a tags which have an id and href :
$doc = new DOMDocument();
$doc->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->query('//a[#href][#id]');
$dom = new DOMDocument();
$dom->loadHTML($curl_scraped_page);
$links = $dom->getElementsByTagName('a');
$processed_links = array();
foreach ($links as $link)
{
if ($link->hasAttribute('id') && $link->hasAttribute('href'))
{
$processed_links[$link->getAttribute('id')] = $link->getAttribute('href');
}
}
This is the solution regarding your question.
http://simplehtmldom.sourceforge.net/
include('simple_html_dom.php');
$html = file_get_html('http://www.google.com/');
foreach($html->find('#www-core-css') as $e) echo $e->outertext . '<br>';
I think that the easiest way is combining 2 following classes to pull information from another website:
Pull info from any HTML tag, contents or tag attribute: http://simplehtmldom.sourceforge.net/
Easy to handle curl, supports POST requests: https://github.com/php-curl-class/php-curl-class
Example:
include('path/to/curl.php');
include('path/to/simple_html_dom.php');
$url = 'http://www.example.com';
$curl = new Curl;
$html = str_get_html($curl->get($url)); //full HTML of website
$linksWithSpecificID = $html->find('a[id=foo]'); //returns array of elements
Check Simple HTML DOM Parser Manual from the upper link for the manipulation with HTML data.

Categories