file_get_html multiple times in the same script - php

I am using simple html dom parser. When I am requesting a page like file_get_html() I am getting 2 values. One is title and the other one is an url. Within this I want to do file_get_html() again.
But I am getting similar data for the second traverse.
Like this script :
foreach($urls as $value) {
$html=file_get_html($value);
foreach($html->find('div[class=data] a') as $content) {
$url2='http://abc.com/'.$content->href;
$childHtml=file_get_html($url2);
echo $childHtml; //Proble is here i am getting the previous data-->html
}
}
What am I doing wrong here?
This is the main crawling code
$urls=GenerateURLS($currentmonth);
$tracker=0;
$urlHolderArray=array();
foreach ($urls as $value) {
$html=file_get_html($value); //Here I am requesting the html dom
foreach ($html->find('div[id=centrepanel] div[class=events_listing_container] div[class=events_info_container] div[class=events_image] a') as $content) {
$proxyURL="http://www.junkclub.com/".$content->href;
array_push($urlHolderArray,$proxyURL);
}
}
echo '<pre/>';
print_r($urlHolderArray);
echo '<pre/>';
foreach ($urlHolderArray as $link) {
$htmlCon=file_get_html($link);
}
echo $htmlCon;

Related

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

how to get href from within element using php and simple html dom

I have an html page that looks a bit like this
xxxx
google!
<div class="big-div">
<a href="http://www.url.com/123" title="123">
<div class="little-div">xxx</div></a>
<a href="http://www.url.com/456" title="456">
<div class="little-div">xxx</div></a>
</div>
xxxx
I am trying to pull of the href's out of the big-div. I can get all the href's out of the whole page by using code like below.
$links = $html->find ('a');
foreach ($links as $link)
{
echo $link->href.'<br>';
}
But how do I get only the href's within the div "big-div".
Edit:
I think I got it. For those that care:
foreach ($html->find('div[class=big-div]') as $element) {
$links = $element->find('a');
foreach ($links as $link) {
echo $link->href.'<br>';
}
}
The documentation is useful:
$html->find(".big-div")->find('a');
And then proceed to get the href and whatever other attributes you are interested in.
Edit: The above would be the general idea. I've never used Simple HTML DOM, so perhaps you need to tweak the syntax somewhat. Try:
foreach($html->find('.big-div') as $bigDiv) {
$link = $bigDiv->find('a');
echo $link->href . '<br>';
}
or perhaps:
$bigDivs = $html->find('.big-div');
foreach($bigDivs as $div) {
$link = $div->find('a');
echo $link->href . '<br>';
}
Quick flip - put this in your foreach
$image = $html->find('.big-div')->href;

html code appear in the page as source and not executed by browser

my code is like that:
$link = "<a class=\"openevent\" href=\"$finalUrl\" target=\"_blank\">Open Event</a>";
foreach ($spans as $span) {
if ($span->getAttribute('class') == 'category') {
$span->nodeValue .= $link;
}
}
the problem here is that the $link variable is echo in the page as html source as this
<a class="openevent" href="http://www.mysite.com/Free-Live-Streaming-Video-Online-Other-Cycling-Cycling-The-Tour-of-Britain-170638.html" target="_blank">Open Event</a>
instead of appearing as usual hyperlink
what is wrong with my code?
You are adding text to the spans' node value, to add an anchor node you'll have to create an anchor node with createElement and add the attributes to it then append it to the span.
foreach ($spans as $span) {
if ($span->getAttribute('class') == 'category') {
$link = $doc->createElement('a', 'Open Event');
$link->setAttribute("class", "openevent");
$link->setAttribute("href", $finalUrl);
$link->setAttribute("target", "_blank");
$span->appendChild($link);
}
}
Look like you are building some kind of xml in foreach. when you build xml it encodes the html characters '<' as &mp;gt; so while print you will not actually print the html. may be html_entity_decode function will work for you.
echo html_entity_decode($doc->saveHTML())

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

PHP: $_POST array to XML file and display results

I'm creating a "Madlibs" page where visitors can create funny story things online. The original files are in XML format with the blanks enclosed in XML tags
(Such as blablabla <PluralNoun></PluralNoun> blablabla <Verb></Verb> ).
The form data is created using XSL and the results are saved using a $_POST array. How do I post the $_POST array between the matching XML tags and then display the result to the page? I'm sure it uses a "foreach" statement, but I'm just not familiar enough with PHP to figure out what functions to use. Any help would be great.
Thanks,
E
I'm not sure if I understood your problem quite well, but I think this might help:
// mocking some $_POST variables
$_POST['Verb'] = 'spam';
$_POST['PluralNoun'] = 'eggs';
// original template with blanks (should be loaded from a valid XML file)
$xml = 'blablabla <PluralNoun></PluralNoun> blablabla <Verb></Verb>';
$valid_xml = '<?xml version="1.0"?><xml>' . $xml . '</xml>';
$doc = DOMDocument::loadXML($valid_xml, LIBXML_NOERROR);
if ($doc !== FALSE) {
$text = ''; // used to accumulate output while walking XML tree
foreach ($doc->documentElement->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) { // keep text nodes
$text .= $child->wholeText;
} else if (array_key_exists($child->tagName, $_POST)) {
// replace nodes whose tag matches a POST variable
$text .= $_POST[$child->tagName];
} else { // keep other nodes
$text .= $doc->saveXML($child);
}
}
echo $text . "\n";
} else {
echo "Failed to parse XML\n";
}
Here is PHP foreach syntax. Hope it helps
$arr = array('fruit1' => 'apple', 'fruit2' => 'orange');
foreach ($arr as $key => $val) {
echo "$key = $val\n";
}
and here is the code to loop thru your $_POST variables:
foreach ($_POST as $key => $val) {
echo "$key = $val\n";
// then you can fill each POST var to your XML
// maybe you want to use PHP str_replace function too
}

Categories