simple_html_dom - read html page, two arrays - php

This is my entire code
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// for each header output a nice list of the headers
foreach ($output as $headers){
echo "< a href='#'>$headers</a>" . "<br />";
}
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $linkk){
echo "<a href='$linkk'>$linkk</a>" . "<br />";
}
this code will produce the headings in a nice list, and will also produce a nice list of the links.
My problem is that i need to combine them, i would like the $header to be the text of the href, and the link in the href to be the $linkk
like this..
< a href ='$linkk'>$headers</a>
I dont know how to do this as i have two foreach statements. I tried to combine them but i was unsuccessful.
Any help will be greatly appreciated.
Thanks.

Try this:
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $i=>$linkk) {
$headline = isset($output[$i]) ? $output[$i] : '(empty)';
echo "<a href='$linkk'>$headline</a>" . "<br />";
}

Here is the foreach you are looking for:
foreach($output as $i=>$headers) {
$linkk = $result[$i];
echo "< a href='$linkk'>$headers</a>" . "<br />";
}
This assumes the arrays have the same length and also the correct order.

Related

Unable to print links in another function

I've written some code in php to scrape some preferable links out of the main page of wikipedia. When I execute my script, the links are coming through accordingly.
However, at this point I've defined two functions within my script in order to learn how to pass links from one function to another. Now, my goal is to print the links in the latter function but it only prints the first link and nothing else.
If I use only this function fetch_wiki_links(), I can get several links but when i try to print the same within get_links_in_ano_func() then it prints the first link only.
How can I get them all even when I use the second function?
This is what I've written so far:
include("simple_html_dom.php");
$prefix = "https://en.wikipedia.org";
function fetch_wiki_links($prefix)
{
$weblink = "https://en.wikipedia.org/wiki/Main_Page";
$htmldoc = file_get_html($weblink);
foreach ($htmldoc->find("a[href^='/wiki/']") as $a) {
$links = $a->href . '<br>';
$absolute_links = $prefix . $links;
return $absolute_links;
}
}
function get_links_in_ano_func($absolute_links)
{
echo $absolute_links;
}
$items = fetch_wiki_links($prefix);
get_links_in_ano_func($items);
Your function returned the value at the very first iteration. You will need something like this:
function fetch_wiki_links($prefix)
{
$weblink = "https://en.wikipedia.org/wiki/Main_Page";
$htmldoc = file_get_html($weblink);
$absolute_links = array();
foreach ($htmldoc->find("a[href^='/wiki/']") as $a) {
$links = $a->href . '<br>';
$absolute_links []= $prefix . $links;
}
return implode("\n", $absolute_links);
}

annoying array tags.. want a pretty output

What i'm trying to do is make my output usable for a spreadsheet.
I want each item in the output without array tags or not mashed together but starting with an asterisk and ending with a % sign.
<?php
$file = file_get_contents('aaa.txt'); //get file to string
$row_array = explode("\n",$file); //cut string to rows by new line
$row_array = array_count_values(array_filter($row_array));
foreach ($row_array as $key=>$counts) {
if ($counts==1)
$no_duplicates[] = $key;
}
//do what You want
echo '<pre>';
print_r($no_duplicates);
//write to file. If file don't exist. Create it
file_put_contents('no_duplicates.txt',$no_duplicates);
?>
Maybe this would give you what you want:
$str = "*" . implode("% *", $no_duplicates) . "%";
echo '<pre>';
echo $str;
echo '</pre>';

PHP foreach loop read files, create array and print file name

Could someone help me with this?
I have a folder with some files (without extention)
/module/mail/templates
With these files:
test
test2
I want to first loop and read the file names (test and test2) and print them to my html form as dropdown items. This works (the rest of the form html tags are above and under the code below, and omitted here).
But I also want to read each files content and assign the content to a var $content and place it in an array I can use later.
This is how I try to achieve this, without luck:
foreach (glob("module/mail/templates/*") as $templateName)
{
$i++;
$content = file_get_contents($templateName, r); // This is not working
echo "<p>" . $content . "</p>"; // this is not working
$tpl = str_replace('module/mail/templates/', '', $templatName);
$tplarray = array($tpl => $content); // not working
echo "<option id=\"".$i."\">". $tpl . "</option>";
print_r($tplarray);//not working
}
This code worked for me:
<?php
$tplarray = array();
$i = 0;
echo '<select>';
foreach(glob('module/mail/templates/*') as $templateName) {
$content = file_get_contents($templateName);
if ($content !== false) {
$tpl = str_replace('module/mail/templates/', '', $templateName);
$tplarray[$tpl] = $content;
echo "<option id=\"$i\">$tpl</option>" . PHP_EOL;
} else {
trigger_error("Cannot read $templateName");
}
$i++;
}
echo '</select>';
print_r($tplarray);
?>
Initialize the array outside of the loop. Then assign it values inside the loop. Don't try to print the array until you are outside of the loop.
The r in the call to file_get_contents is wrong. Take it out. The second argument to file_get_contents is optional and should be a boolean if it is used.
Check that file_get_contents() doesn't return FALSE which is what it returns if there is an error trying to read the file.
You have a typo where you are referring to $templatName rather than $templateName.
$tplarray = array();
foreach (glob("module/mail/templates/*") as $templateName) {
$i++;
$content = file_get_contents($templateName);
if ($content !== FALSE) {
echo "<p>" . $content . "</p>";
} else {
trigger_error("file_get_contents() failed for file $templateName");
}
$tpl = str_replace('module/mail/templates/', '', $templateName);
$tplarray[$tpl] = $content;
echo "<option id=\"".$i."\">". $tpl . "</option>";
}
print_r($tplarray);

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

Need some help with XML parsing

The XML feed is located at: http://xml.betclick.com/odds_fr.xml
I need a php loop to echo the name of the match, the hour, and the bets options and the odds links.
The function will select and display ONLY the matchs of the day with streaming="1" and the bets type "Ftb_Mr3".
I'm new to xpath and simplexml.
Thanks in advance.
So far I have:
<?php
$xml_str = file_get_contents("http://xml.betclick.com/odds_fr.xml");
$xml = simplexml_load_string($xml_str);
// need xpath magic
$xml->xpath();
// display
?>
Xpath is pretty simple once you get the hang of it
you basically want to get every match tag with a certain attribute
//match[#streaming=1]
will work pefectly, it gets every match tag from underneath the parent tag with the attribute streaming equal to 1
And i just realised you also want matches with a bets type of "Ftb_Mr3"
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]
This will return the bet node though, we want the match, which we know is the grandparent
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..
the two dots work like they do in file paths, and gets the match.
now to work this into your sample just change the final bit to
// need xpath magic
$nodes = $xml->xpath('//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..');
foreach($nodes as $node) {
echo $node['name'].'<br/>';
}
to print all the match names.
I don't know how to work xpath really, but if you want to 'loop it', this should get you started:
<?php
$xml = simplexml_load_file("odds_fr.xml");
foreach ($xml->children() as $child)
{
foreach ($child->children() as $child2)
{
foreach ($child2->children() as $child3)
{
foreach($child3->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
}
}
}
?>
That gets you to the 'match' tag which has the 'streaming' attribute. I don't really know what 'matches of the day' are, either, but...
It's basically right out of the w3c reference:
http://www.w3schools.com/PHP/php_ref_simplexml.asp
I am using this on a project. Scraping Beclic odds with:
<?php
$match_csv = fopen('matches.csv', 'w');
$bet_csv = fopen('bets.csv', 'w');
$xml = simplexml_load_file('http://xml.cdn.betclic.com/odds_en.xml');
$bookmaker = 'Betclick';
foreach ($xml as $sport) {
$sport_name = $sport->attributes()->name;
foreach ($sport as $event) {
$event_name = $event->attributes()->name;
foreach ($event as $match) {
$match_name = $match->attributes()->name;
$match_id = $match->attributes()->id;
$match_start_date_str = str_replace('T', ' ', $match->attributes()->start_date);
$match_start_date = strtotime($match_start_date_str);
if (!empty($match->attributes()->live_id)) {
$match_is_live = 1;
} else {
$match_is_live = 0;
}
if ($match->attributes()->streaming == 1) {
$match_is_running = 1;
} else {
$match_is_running = 0;
}
$match_row = $match_id . ',' . $bookmaker . ',' . $sport_name . ',' . $event_name . ',' . $match_name . ',' . $match_start_date . ',' . $match_is_live . ',' . $match_is_running;
fputcsv($match_csv, explode(',', $match_row));
foreach ($match as $bets) {
foreach ($bets as $bet) {
$bet_name = $bet->attributes()->name;
foreach ($bet as $choice) {
// team numbers are surrounded by %, we strip them
$choice_name = str_replace('%', '', $choice->attributes()->name);
// get the float value of odss
$odd = (float)$choice->attributes()->odd;
// concat the row to be put to csv file
$bet_row = $match_id . ',' . $bet_name . ',' . $choice_name . ',' . $odd;
fputcsv($bet_csv, explode(',', $bet_row));
}
}
}
}
}
}
fclose($match_csv);
fclose($bet_csv);
?>
Then loading the csv files into mysql. Running it once a minute, works great so far.

Categories