i am using this html parser that is searching for HTML elements and printing them on screen as they come up
some are ID some are H4
now the issue is after it finds an ID it looks for a H4
now when i do a for each loop at the end only the H4 are coming up but not the one price
i would like to know why this is happening
i am new and loving PHP but i dont get why the key is reseting and forgeting the ID key
CODE =>
<?php
ini_set('memory_limit','128M');
set_time_limit(0);
include_once('simple_html_dom.php');
$target_url= "ethicon2.html";
$html = new simple_html_dom();
$html -> load_file($target_url);
$line = 0;
$ref = $html-> find('.price');
$ref = $html-> find('h4');
$ref = $html-> find('h4');
foreach ($ref as $value) {
print "$value<br>";
}
?>
Try adding them into an array like so:
$ref[] = $html-> find('.price');
$ref[] = $html-> find('h4');
$ref[] = $html-> find('h4');
EDIT
If you want these to appear in one array try this
$ref2 = array();
foreach($ref as $r)
{
$ref2 = array_merge($ref2,$r);
}
print_r($ref2);
Related
Hey I've been trying to scrape data from an html table and I'm not having much luck.
Website: https://www.dnr.state.mn.us/hunting/seasons.html
What I'm trying to do: I want to grab the contents of the table and encode it into json like
['event_title' 'Waterfowl'] and ['event_date' '09/25/21']
but I don't know how to do this, I've tried a couple different things but in the end I can't get it to work.
Code Example (Closest I got):
<?php
$dom = new DOMDocument;
$page = file_get_contents('https://www.dnr.state.mn.us/hunting/seasons.html');
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//tbody/tr') as $tr) {
$tmp = []; // reset the temporary array so previous entries are removed
foreach ($xpath->query("td[#class]", $tr) as $td) {
$key = preg_match('~[a-z]+$~', $td->getAttribute('class'), $out) ? $out[0] : 'no_class';
if ($key === "event-title") {
$tmp['event_title'] = $xpath->query("a", $td);
}
$tmp[$key] = trim($td->textContent);
}
//$tmp['event_date'] = date("M. dS 'y", strtotime(preg_replace('~\.|\d+[ap]m *~', '', $tmp['date'])));
//$result[] = $tmp;
$marray[] = array_unique($tmp);
print_r($marray);
}
//$array2 = var_export($result);
//print_r($array2[1]);
//var_export($result);
//echo "\n----\n";
//echo json_encode($result);
?>
Testing with data scraping. The output I'm scraping, is a percent. So I basically slapped on a
echo "%<br>";
At the end of the actual number output which is
echo $ret_[66];
However there's an issue where the percent is actually appearing before the number as well, which is not desirable. This is the output:
%
-0.02%
Whereas what I'm trying to get is just -0.02%
Clearly I'm doing something wrong with the PHP. I'd really appreciate any feedback/solutions. Thank you!
Full code:
<?php
error_reporting(E_ALL^E_NOTICE^E_WARNING);
include_once "global.php";
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.moneycontrol.com/markets/global-indices/');
$xpath = new DOMXPath($doc);
$query = "//div[#class='MT10']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key => $val){
$ret_[$key] = trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
echo $ret_[66];
echo "%<br>";
}
<?php
echo "%<br>";
?>
On a seperate following PHP code. Does the same thing.
Working on dom html . I want to convert node value to string:
$html = #$dom->loadHTMLFile('url');
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('body');
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
foreach ($rows as $text =>$row)
{
$t=1;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
//getting values
$rr = #$cols->item(0)->nodeValue;
print $rr; ( it prints values of all 'td' tag fine)
}
print $rr; ( it prints nothing) I want it to print here
?>
I want nodevalues to be converted into string for further manipulation.
Every time you loop through the foreach you overwrite the value of the $rr variable. The second print $rr will print the value of the last td - if it's empty, then it will print nothing.
If what you are trying to do is print all the values, instead write them to an array:
$rr = array();
foreach($rows as $text =>$row) {
$rr[] = $cols->item(0)->nodeValue;
}
print_r($rr);
// new dom object
$dom = new DOMDocument();
//load the html
$html = #$dom->loadHTMLFile('http://webapp-da1-01.corp.adobe.com:8300/cfusion/bootstrap/');
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('head');
//get all rows from the table
$la=array();
$rows = $tables->item(0)->getElementsByTagName('tr');
// loop over the table rows
$array = array();
foreach ($rows as $text =>$row)
{
$t=1;
$tt=$text;
// get each column by tag name
$cols = $row->getElementsByTagName('td');
// echo the values
#echo #$cols->item(0)->nodeValue.'';
// echo #$cols->item(1)->nodeValue.'';
$array[$row] = #$cols->item($t)->nodeValue;
}
print_r ($array);
It prints Array
(
)
nothing more. i also used "$cols->item(0)->nodeValue;"
Use DOM::saveXML or DOM::saveHTML to convert node value to string.
did you try #$cols->item(0)->textContent
I am trying to parse the table shown here into a multi-dimensional php array. I am using the following code but for some reason its returning an empty array. After searching around on the web, I found this site which is where I got the parseTable() function from. From reading the comments on that website, I see that the function works perfectly. So I'm assuming there is something wrong with the way I'm getting the HTML code from file_get_contents(). Any thoughts on what I'm doing wrong?
<?php
$data = file_get_contents('http://flow935.com/playlist/flowhis.HTM');
function parseTable($html)
{
// Find the table
preg_match("/<table.*?>.*?<\/[\s]*table>/s", $html, $table_html);
// Get title for each row
preg_match_all("/<th.*?>(.*?)<\/[\s]*th>/", $table_html[0], $matches);
$row_headers = $matches[1];
// Iterate each row
preg_match_all("/<tr.*?>(.*?)<\/[\s]*tr>/s", $table_html[0], $matches);
$table = array();
foreach($matches[1] as $row_html)
{
preg_match_all("/<td.*?>(.*?)<\/[\s]*td>/", $row_html, $td_matches);
$row = array();
for($i=0; $i<count($td_matches[1]); $i++)
{
$td = strip_tags(html_entity_decode($td_matches[1][$i]));
$row[$row_headers[$i]] = $td;
}
if(count($row) > 0)
$table[] = $row;
}
return $table;
}
$output = parseTable($data);
print_r($output);
?>
I want my output array to look something like this:
1
--> 11:33AM
--> DEV
--> IN THE DARK
2
--> 11:29AM
--> LIL' WAYNE
--> SHE WILL
3
--> 11:26AM
--> KARDINAL OFFISHALL
--> NUMBA 1 (TIDE IS HIGH)
Don't cripple yourself parsing HTML with regexps! Instead, let an HTML parser library worry about the structure of the markup for you.
I suggest you to check out Simple HTML DOM (http://simplehtmldom.sourceforge.net/). It is a library specifically written to aid in solving this kind of web scraping problems in PHP. By using such a library, you can write your scraping in much less lines of code without worrying about creating working regexps.
In principle, with Simple HTML DOM you just write something like:
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
// Parse table row here
}
This can be then extended to capture your data in some format, for instance to create an array of artists and corresponding titles as:
<?php
require('simple_html_dom.php');
$table = array();
$html = file_get_html('http://flow935.com/playlist/flowhis.HTM');
foreach($html->find('tr') as $row) {
$time = $row->find('td',0)->plaintext;
$artist = $row->find('td',1)->plaintext;
$title = $row->find('td',2)->plaintext;
$table[$artist][$title] = true;
}
echo '<pre>';
print_r($table);
echo '</pre>';
?>
We can see that this code can be (trivially) changed to reformat the data in any other way as well.
I tried simple_html_dom but on larger files and on repeat calls to the function I am getting zend_mm_heap_corrupted on php 5.3 (GAH). I have also tried preg_match_all (but this has been failing on a larger file (5000) lines of html, which was only about 400 rows of my HTML table.
I am using this and its working fast and not spitting errors.
$dom = new DOMDocument();
//load the html
$html = $dom->loadHTMLFile("htmltable.html");
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$tables = $dom->getElementsByTagName('table');
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
// get each column by tag name
$cols = $rows->item(0)->getElementsByTagName('th');
$row_headers = NULL;
foreach ($cols as $node) {
//print $node->nodeValue."\n";
$row_headers[] = $node->nodeValue;
}
$table = array();
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
$row = array();
$i=0;
foreach ($cols as $node) {
# code...
//print $node->nodeValue."\n";
if($row_headers==NULL)
$row[] = $node->nodeValue;
else
$row[$row_headers[$i]] = $node->nodeValue;
$i++;
}
$table[] = $row;
}
var_dump($table);
This code worked well for me.
Example of original code is here.
http://techgossipz.blogspot.co.nz/2010/02/how-to-parse-html-using-dom-with-php.html
Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.