How to scrape data from HTML Table in PHP - php

Hey I've been trying to scrape data from an html table and I'm not having much luck.
Website: https://www.dnr.state.mn.us/hunting/seasons.html
What I'm trying to do: I want to grab the contents of the table and encode it into json like
['event_title' 'Waterfowl'] and ['event_date' '09/25/21']
but I don't know how to do this, I've tried a couple different things but in the end I can't get it to work.
Code Example (Closest I got):
<?php
$dom = new DOMDocument;
$page = file_get_contents('https://www.dnr.state.mn.us/hunting/seasons.html');
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//tbody/tr') as $tr) {
$tmp = []; // reset the temporary array so previous entries are removed
foreach ($xpath->query("td[#class]", $tr) as $td) {
$key = preg_match('~[a-z]+$~', $td->getAttribute('class'), $out) ? $out[0] : 'no_class';
if ($key === "event-title") {
$tmp['event_title'] = $xpath->query("a", $td);
}
$tmp[$key] = trim($td->textContent);
}
//$tmp['event_date'] = date("M. dS 'y", strtotime(preg_replace('~\.|\d+[ap]m *~', '', $tmp['date'])));
//$result[] = $tmp;
$marray[] = array_unique($tmp);
print_r($marray);
}
//$array2 = var_export($result);
//print_r($array2[1]);
//var_export($result);
//echo "\n----\n";
//echo json_encode($result);
?>

Related

Array only printing last value [duplicate]

This question already has answers here:
How to store values from foreach loop into an array?
(9 answers)
Closed 1 year ago.
Consider the following php code which is scraping a clients old static website for his customers emails...
$urls = explode(PHP_EOL, file_get_contents('urls.txt'));
print '<pre>'; print_r($urls); print '</pre>';
print '<strong>Results:</strong><br>';
function get_emails($url) {
$html = file_get_contents($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$href = $link->getAttribute('href');
if (strpos($href, 'mailto') !== false) {
return str_replace("mailto:","",$href) . '<br>';
}
}
}
foreach ($urls as $key => $url) {
print get_emails($url);
}
I am reading a list of urls from urls.txt but the result is only the one of the last url in the file. All of the others are ignored. I had hoped it would return a nice list of all his customers urls so we can import them into the new site.
Can someone help diagnose the issue?
It's because of:-
return str_replace("mailto:","",$href) . '<br>';
It will terminate the execution of loop.
1. Either do:-
$urls = explode(PHP_EOL, file_get_contents('urls.txt'));
print '<pre>'; print_r($urls); print '</pre>';
print '<strong>Results:</strong><br>';
function get_emails($url) {
$html = file_get_contents($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$href = $link->getAttribute('href');
echo str_replace("mailto:","",$href) . '<br>';
}
}
foreach ($urls as $key => $url) {
get_emails($url);
}
2. OR do like below:-
$urls = explode(PHP_EOL, file_get_contents('urls.txt'));
print '<pre>'; print_r($urls); print '</pre>';
print '<strong>Results:</strong><br>';
function get_emails($url) {
$html = file_get_contents($url);
$data = array(); //define array
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$href = $link->getAttribute('href');
$data[] = str_replace("mailto:","",$href) . '<br>'; //assign each value to the array
}
return $data;
}
foreach ($urls as $key => $url) {
print_r(get_emails($url));
}

get DHL tracking statuses

Please help me to figure out what PHP API or PHP script should I use to get from DHL the shipment statuses having only available DHL Tracking Codes provided by the logistic company which fulfill shipping of our orders from e-commerce website. My Task is to create a PHP CronJob code which would check and register the Status of DHL Tracking Shipping for using them in back-end reports.
I would much appreciate any suggestion which may help me to find the right direction.
I am still looking to find the right way to achieve my task. So, far I do not see other way than Parsing DHL Tracking webpage considering having only Tracking Number available which it seems to be insufficient for using them for some API. DHL API requires Login credentials, secret keys and so on... However, my current parsing code might be useful for someone who looks for similar solution. Just include your Tracking Codes and run the code in your localhost or even on http://phpfiddle.org/:
$tracking_array=Array('000000000000', '1111111111111'); // Tracking Codes
function create_track_url($track)
{
$separator = '%2C+';
$count = count($track);
$url = '';
if ($count < 2 && $count > 0){
$url = $track[0];
}else if ($count >1){
foreach ($track as $k => $v)
{
$sep = ($count-2);
if ($k > $sep){
$separator ='';
}
$url .= $v.$separator;
}
}
return $url;
}
//load the html
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("https://nolp.dhl.de/nextt-online-public/en/search?piececode=".create_track_url($tracking_array));
//discard white space
$dom->preserveWhiteSpace = false;
//the table by its tag name
$xpath = new DOMXpath($dom);
$expression = './/h2[contains(#class, "panel-title")]';
$track_codes =array();
foreach ($xpath->evaluate($expression) as $div) {
$track_codes[]= preg_replace( '/[^0-9]/', '', $div->nodeValue );
}
$tables = $dom->getElementsByTagName('table');
$table = array();
foreach($track_codes as $key => $val)
{
//get all rows from the table
$rows = $tables->item($key)->getElementsByTagName('tr');
// get each column by tag name
$cols = $rows->item($key)->getElementsByTagName('th');
$row_headers = NULL;
foreach ($cols as $node) {
//print $node->nodeValue."\n";
$row_headers[] = $node->nodeValue;
}
//get all rows from the table
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
// get each column by tag name
$cols = $row->getElementsByTagName('td');
$row = array();
$i=0;
foreach ($cols as $node) {
# code...
//print $node->nodeValue."\n";
if($row_headers==NULL)
$row[] = $node->nodeValue;
else
$row[$row_headers[$i]] = $node->nodeValue;
$i++;
}
$table[$val][] = $row;
}
}
print '<pre>';
print_r($table);

Getting data from HTML using DOMDocument

I'm trying to get data from HTML using DOM. I can get some data, but can't figure out how to get the rest. Here is an image highlighting the data I want.
http://i.imgur.com/Es51s5s.png
here is the code itself
http://pastebin.com/Re8qEivv
and here my PHP code
$html = file_get_contents('result.html');
$dom = new DOMDocument;
$dom->loadHTML($html);
$tr = $dom->getElementsByTagName('tr');
foreach ($tr as $row){
$td = $row->getElementsByTagName('td');
$td1 = $td->item(1);
$td2 = $td->item(2);
foreach ($td1->childNodes as $node){
$title = $node->textContent;
}
foreach ($td2->childNodes as $node){
$type = $node->textContent;
}
}
Figured it out
$html = file_get_contents('result.html');
$dom = new DOMDocument;
$dom->loadHTML($html);
$tr = $dom->getElementsByTagName('tr');
foreach ($tr as $row){
$td = $row->getElementsByTagName('td');
$td1 = $td->item(1);
$td2 = $td->item(2);
$title = $td1->childNodes->item(0)->textContent;
$firstURL = $td1->getElementsByTagName('a')->item(0)->getAttribute('href');
$type = $td2->childNodes->item(0)->textContent;
$imageURL = $td2->getElementsByTagName('img')->item(0)->getAttribute('src');
}
I have used following class.
http://sourceforge.net/projects/simplehtmldom/
This is very simple and easy to use class.
You can use
$html->find('#RosterReport > tbody', 0);
to find specific table
$html->find('tr')
$html->find('td')
to find table rows or columns
Note $html is variable have full html dom content.

looking to loop for 2 element in the same time (php /xpath )

I'm trying to extract 2 elements using PHP Curl and Xpath!
So far have the element separated in foreach but I would like to have them in the same time:
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
foreach ($elements as $element) {
$url = $element->nodeValue;
//$title = $element->nodeValue;
}
When I echo each one out of the foreach I only get 1 element and when its echoed inside the foreach i get all of them.
My question is how can I get them both at the same time (url and title ) and whats the best way to add them into myqsl using pdo.
thank you
There is no need, in this case, to use XPath twice. You could do one query and navigate to the associated other node(s).
For example, find all of the hrefs that you are interested in and get their ownerElement's (the <a>) node value.
$hrefs = $xpath->query("//p[#class='row']/a/#href");
foreach ($hrefs as $href) {
$url = $href->value;
$title = $href->ownerElement->nodeValue;
// Insert into db here
}
Or, find all of the <a>s that you are interested in and get their href attributes.
$anchors = $xpath->query("//p[#class='row']/a[#href]");
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute("href");
$title = $anchor->nodeValue;
// Insert into db here
}
You're overwriting $url on each iteration. Maybe use an array?
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
$urls = array();
foreach ($elements as $element){
array_push($urls, $element->nodeValue);
//$title = $element->nodeValue;
}

Xpath for extracting links

I create an scraper for an automoto site and first I want to get all manufactures and after that all links of models for each manufactures but with the code below I get only the first model on the list. Why?
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.auto-types.com');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='clearfix_center']/a/#href");
$output = array();
foreach($entries as $e) {
$dom2 = new DOMDocument();
#$dom2->loadHTMLFile('http://www.auto-types.com' . $e->textContent);
$xpath2 = new DOMXPath($dom2);
$data = array();
$data['newLinks'] = trim($xpath2->query("//div[#class='modelImage']/a/#href")->item(0)->textContent);
$output[] = $data;
}
echo '<pre>' . print_r($output, true) . '</pre>';
?>
SO I need to get: mercedes/100, mercedes/200, mercedes/300 but now with my script i get only the first link so mercedes/100...
please help
You need to iterate through the results instead of just taking the first item:
$items = $xpath2->query("//div[#class='modelImage']/a/#href");
$links = array();
foreach($items as $item) {
$links[] = $item->textContent;
}
$data['newLinks'] = implode(', ', $links);

Categories