Parsing HTML Table Data from XML with PHP - php

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz

The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!

If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

Related

Get a web page XML code using php and use XPATH on it

Maybe its a question answered before but im so noobie in Web Development.
Im trying to get a full XML text from this page:
Human Genome
And, I need to do some XPath queries in that code, like "get the ID" and others.
For example:
//eSearchResult/IdList/Id/node()
How I can to get the full XML in a php object to request data throught XPath queries?
I used this code before:
<?php
$text = $_REQUEST['text'];
$xmlId = simplexml_load_file('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term='.$text.'%5bGene%20Name%5d+AND+%22Homo%20sapiens%22%5bOrganism');
$id = $xmlId->IdList[0]->Id;
$xmlGeneralData = simplexml_load_file('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id='.$id.'&retmode=xml');
$geneName = $xmlGeneralData->DocumentSummarySet->DocumentSummary[0]->Name;
$geneDesc = $xmlGeneralData->DocumentSummarySet->DocumentSummary[0]->Description;
$geneChromosome = $xmlGeneralData->DocumentSummarySet->DocumentSummary[0]->Chromosome;
echo "Id: ".$id."\n";
echo "Name: ".$geneName."\n";
echo "Description: ".$geneDesc."\n";
echo "Chromosome: ".$geneChromosome."\n";?>
But, according with the profesor, this code doesn't use Xpath queries and is required that the page use it.
Someone can help me or explain me how to do it?
Here's converted code to Xpath query.
<?php
$text = $_REQUEST['text'];
$xmlId = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term='.$text.'%5bGene%20Name%5d+AND+%22Homo%20sapiens%22%5bOrganism';
//Load XML and define Xpath
$xml_id = new DOMDocument();
$xml_id->load($xmlId);
$xpath = new DOMXPath($xml_id);
//Xpath query to get ID
$elements = $xpath->query("//eSearchResult/IdList/Id");
//Loop through result of xpath query and store in array of ID
if ($elements->length >0) {
foreach ($elements as $entry) {
$id[] = $entry->nodeValue;
}
}
echo "Id: ".$id[0]."\n";
//Output the first string of ID array from xpath result set
$xmlGeneralData = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id='.$id[0].'&retmode=xml';
//Load XML and define Xpath
$xml_gd = new DOMDocument();
$xml_gd->load($xmlGeneralData);
$xpath = new DOMXPath($xml_gd);
//Xpath query to search for Document Summary with first string of ID array from previous result set
$elements = $xpath->query("//eSummaryResult/DocumentSummarySet/DocumentSummary[#uid='".$id[0]."']");
//Loop through result of xpath query and find nodes and print out the result
if ($elements->length >0) {
foreach ($elements as $entry) {
echo "Name: ".$entry->getElementsByTagName('Name')->item(0)->nodeValue."\n";
echo "Description: ".$entry->getElementsByTagName('Description')->item(0)->nodeValue."\n";
echo "Chromosome: ".$entry->getElementsByTagName('Chromosome')->item(0)->nodeValue."\n";
}
}
?>

DomNode get value of item

Hello I'm new with domnode and i'm trying to check the values from an xml tree which loads ok.
Here is my code but I dont understand why is not working.
private function createCSV($xml, $f)
{
foreach ($xml->getElementsByTagName('*') as $item)
{
$hasChild = $item->hasChildNodes() ? true : false;
if(!$hasChild)
{
//echo 'Doesn\'t have children';
echo 'Value: ' . $item->nodeValue;
}
else
{
//echo 'Has children';
$this->createCSV($item, $f);
}
}
}
$item->nodeValue doesnt print anything to the browser.
I read the documentation but I can't see any mistake.
PS. $item->tagname doesnt work either.
UPDATE
whe using this: echo $item->ownerDocument->saveHTML($item);
I get the tags listed but i dont get the data inside(between the tags) like innerHTML in javascript.
UPDATE
sample xml data : http://pastebin.com/dkuUUC0Q
Text nodes are also considered child nodes, but you're only iterating element nodes (get Elements ByTagName). Because of this you're almost never getting into the 2nd condition.
Try this:
if(!$xml->hasChildNodes()){
printf('Value: %s', $xml->nodeValue);
return;
}
foreach($xml->childNodes as $item)
$this->createCSV($item, $f);
XPath version:
$xpath = new DOMXPath($xml);
$text = $xpath->query('//text()[normalize-space()]');
foreach($text as $node)
printf('Value: %s', $node->nodeValue);

Remove HTML element from parsed HTML document on a condition

I've parsed a HTML document using Simple PHP HTML DOM Parser. In the parsed document there's a ul-tag with some li-tags in it. One of these li-tags contains one of those dreaded "Add This" buttons which I want to remove.
To make this worse, the list item has no class or id, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it with the parser.
What I want to do is to search for the string 'addthis.com' in all li-elements and remove any element that contains that string.
<ul>
<li>Foobar</li>
<li>addthis.com</li><!-- How do I remove this? -->
<li>Foobar</li>
</ul>
FYI: This is purley a hobby project in my quest to learn PHP and not a case of content theft for profit.
All suggestions are welcome!
Couldn't find a method to remove nodes explicitly, but can remove with setting outertext to empty.
$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting
foreach($html->find('ul li') as $element) {
if (count($element->find('a.addthis_button')) > 0) {
$element->outertext="";
}
}
echo $html;
Well what you can do is use jQuery after the parsing. Something like this:
$('li').each(function(i) {
if($(this).html() == "addthis.com"){
$(this).remove();
}
});
This solution uses DOMDocument class and domnode.removechild method:
$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
$pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
if ($pos !== false) {
$domElemsToRemove[] = $element;
}
}
foreach( $domElemsToRemove as $domElement ){
$domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

ignoring nested elements when parsing xml with php

probably a simple question to answer for someone:::
xml:
<foobar>
<foo>i am a foo</foo>
<bar>i am a bar</bar>
<foo>i am a <bar>bar</bar></foo>
</foobar>
In the above, I want to display all elements that are <foo>. When the script gets to the line with the nested < bar > the result is "i am a bar" .. which isn't the result I had hoped for.
Is it not possible to print out the entire contents of that element as it is, so that i see: "i am a <bar>bar</bar>"
php:
$xml = file_get_contents('sample');
$dom = new DOMDocument;
#$dom->loadHTML($xml);
$resources= $dom->getElementsByTagName('foo');
foreach ($resources as $resource){
echo $resource->nodeValue . "\n";
}
After some trolling and trying to do what I needed with SimpleXML, I arrived at the following conclusion. My issue with SimpleXML was where the elements are. If the xml is structured, and the hierarchy is standard ... I have no problem.
If the XML is a web page for example, and the <foo> element is anywhere, SimpleXML doesn't have a good facility like getElementsByTagName to pull out the element wherever it may be....
<?php
$doc = new DOMDocument();
$doc->load('sample');
$element_name = 'foo';
if ($doc->getElementsByTagName($element_name)->length > 0) {
$resources = $doc->getElementsByTagName($element_name);
foreach ($resources as $resource) {
$id = null;
if (!$resource->hasAttribute('id')) {
$resource->setAttribute('id', gen_uuid());
}
$innerHTML = null;
$children = $resource->childNodes;
foreach ($children as $child) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($child,true));
$innerHTML .= rtrim($tmp_doc->saveHTML());
}
$resource->nodevalue = $innerHTML;
}
}
echo $doc->saveHTML();
?>
Rather than writing all that code, you might try XPath. That expression would be "//foo", which would get a list of all the elements in the document named "foo".
http://php.net/manual/en/simplexmlelement.xpath.php

Categories