I am new to DOM parsing, but I got most of this figured out. I'm just having trouble removing nbsp; from a div.
Here's my PHP:
function parseDOM($url) {
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$movies = array();
foreach ($xpath->query('//div[#class="mshow"]') as $movie) {
$item = array();
$links = $xpath->query('.//a', $movie);
$item['trailer'] = $links->item(0)->getAttribute('href');
$item['reviews'] = $links->item(1)->getAttribute('href');
$item['link'] = $links->item(2)->getAttribute('href');
$item['title'] = $links->item(2)->nodeValue;
$item['rating'] = trim($xpath->query('.//strong/following-sibling::text()',
$movie)->item(0)->nodeValue);
$i = 0;
foreach ($xpath->query('.//div[#class="rsd"]', $movie) as $date) {
$dates = $xpath->query('.//div[#class="rsd"]', $movie);
$times = $xpath->query('.//div[#class="rst"]', $movie);
$item['datetime'][] = $dates->item($i)->nodeValue . $times->item($i)->nodeValue;
$i += 1;
}
$movies[] = $item;
}
return $movies;
}
$url = 'http://www.tribute.ca/showtimes/theatres/may-cinema-6/mayc5/?datefilter=-1';
$movies = parseDOM($url);
foreach ($movies as $key => $value) {
echo $value['title'] . '<br>';
echo $value['link'] . '<br>';
echo $value['rating'] . '<br>';
foreach ($value['datetime'] as $datetime) {
echo $datetime . '<br>';
}
}
Here's what the HTML looks like:
<div class="rst" >6:45pm 9:30pm </div>
Is there something I can add to the xpath query to achieve this? I did try adding strip_tags to $times->item($i)->nodeValue, but it's still printing out like: Thu, May 01: 6:45pm   9:30pm  Â
Edit: str_replace("\xc2\xa0", '', $times->item($i)->nodeValue); seems to do the trick.
try this :
$times->item($i)->nodeValue = str_replace(" ","",$times->item($i)->nodeValue);
it should delete every
EDIT
your line :
$item['datetime'][] = $dates->item($i)->nodeValue . $times->item($i)->nodeValue;
become :
$item['datetime'][] = $dates->item($i)->nodeValue
. str_replace(" ","",$times->item($i)->nodeValue);
EDIT 2
if str_replace does not work, try with str_ireplaceas suggested in comment.
If it still doesn't work, you can also try with :
preg_replace("# #","",$times->item($i)->nodeValue);
EDIT 3
you may have an encoding problem. see uft8_encode
Or piggy solution :
str_replace("Â","",$times->item($i)->nodeValue);
Apolo
Related
Hey I've been trying to scrape data from an html table and I'm not having much luck.
Website: https://www.dnr.state.mn.us/hunting/seasons.html
What I'm trying to do: I want to grab the contents of the table and encode it into json like
['event_title' 'Waterfowl'] and ['event_date' '09/25/21']
but I don't know how to do this, I've tried a couple different things but in the end I can't get it to work.
Code Example (Closest I got):
<?php
$dom = new DOMDocument;
$page = file_get_contents('https://www.dnr.state.mn.us/hunting/seasons.html');
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//tbody/tr') as $tr) {
$tmp = []; // reset the temporary array so previous entries are removed
foreach ($xpath->query("td[#class]", $tr) as $td) {
$key = preg_match('~[a-z]+$~', $td->getAttribute('class'), $out) ? $out[0] : 'no_class';
if ($key === "event-title") {
$tmp['event_title'] = $xpath->query("a", $td);
}
$tmp[$key] = trim($td->textContent);
}
//$tmp['event_date'] = date("M. dS 'y", strtotime(preg_replace('~\.|\d+[ap]m *~', '', $tmp['date'])));
//$result[] = $tmp;
$marray[] = array_unique($tmp);
print_r($marray);
}
//$array2 = var_export($result);
//print_r($array2[1]);
//var_export($result);
//echo "\n----\n";
//echo json_encode($result);
?>
I want to retrieve an HTML element in a page.
<h2 id="resultCount" class="resultCount">
<span>
Showing 1 - 12 of 40,923 Results
</span>
</h2>
I have to get the total number of results for the test in my php.
For now, I get all that is between the h2 tags and I explode the first time with space.
Then I explode again with the comma to concatenate able to convert numbers results in European format. Once everything's done, I test my number results.
define("MAX_RESULT_ALL_PAGES", 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$htmlResultCountPage = file_get_html($queryUrl);
$htmlResultCount = $htmlResultCountPage->find("h2[id=resultCount]");
$resultCountArray = explode(" ", $htmlResultCount[0]);
$explodeCount = explode(',', $resultCountArray[5]);
$europeFormatCount = '';
foreach ($explodeCount as $val) {
$europeFormatCount .= $val;
}
if ($europeFormatCount > MAX_RESULT_ALL_PAGES) {*/
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
At the moment the total number of results is not well recovered and the condition does not happen even when it should.
Someone would have a solution to this problem or any other way?
I would simply fetch the page as a string (not html) and use a regular expression to get the total number of results. The code would look something like this:
define('MAX_RESULT_ALL_PAGES', 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
$queryResult = file_get_contents($queryUrl);
if (preg_match('/of\s+([0-9,]+)\s+Results/', $queryResult, $matches)) {
$totalResults = (int) str_replace(',', '', $matches[1]);
} else {
throw new \RuntimeException('Total number of results not found');
}
if ($totalResults > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
// ...
}
A regex would do it:
...
preg_match("/of ([0-9,]+) Results/", $htmlResultCount[0], $matches);
$europeFormatCount = intval(str_replace(",", "", $matches[1]));
...
Please try this code.
define("MAX_RESULT_ALL_PAGES", 1200);
// new dom object
$dom = new DOMDocument();
// HTML string
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$html_string = file_get_contents($queryUrl);
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//Get all h2 tags
$nodes = $dom->getElementsByTagName('h2');
// Store total result count
$totalCount = 0;
// loop over the all h2 tags and print result
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
if ($attribute->name === 'class' && $attribute->value == 'resultCount') {
$inner_html = str_replace(',', '', trim($node->nodeValue));
$inner_html_array = explode(' ', $inner_html);
// Print result to the terminal
$totalCount += $inner_html_array[5];
}
}
}
}
// If result count grater than 1200, do this
if ($totalCount > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
Give this a try:
$match =array();
preg_match('/(?<=of\s)(?:\d{1,3}+(?:,\d{3})*)(?=\sResults)/', $htmlResultCount, $match);
$europeFormatCount = str_replace(',','',$match[0]);
The RegEx reads the number between "of " and " Results", it matches numbers with ',' seperator.
Hi there i am trying to combine two loops of foreach but i have a problem.
The problem is that the <a href='$link'> is same to all results but they must be different.
Here is the code that i am using:
<?php
$feed = file_get_contents('http://grabo.bg/rss/?city=&affid=16090');
$rss = simplexml_load_string($feed);
$doc = new DOMDocument();
#$doc->loadHTML($feed);
$tags = $doc->getElementsByTagName('link');
foreach ($tags as $tag) {
foreach($rss as $r){
$title = $r->title;
$content = $r->content;
$link = $tag->getAttribute('href');
echo "<a href='$link'>$title</a> <br> $content";
}
}
?>
Where i my mistake? Why it's not working and how i make it work properly?
Thanks in advance!
Both loops were going through different resources so you are just simply cross joining all records in them.
This should work to get the data you need:
<?php
$feed = file_get_contents('http://grabo.bg/rss/?city=&affid=16090');
$rss = simplexml_load_string($feed);
foreach ($rss as $key => $entry) {
if ($key == "entry")
{
$title = (string) $entry->title;
$content = (string) $entry->content;
$link = (string) $entry->link["href"];
echo "<a href='$link'>$title</a><br />" . $content;
}
}
The following snippet of PHP code creates $desc alright, but I like it to introduce two (2) blank spaces between every dpItemFeatureList found as it goes through its iteration.
I can't seem to garner exactly what or where to add a snippet to do this?
function get_description($asin){
$url = 'http://www.amazon.com/gp/aw/d/' . $asin . '?d=f&pd=1';
$data = request_data($url);
$desc = '';
if ($data) {
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
if (preg_match('#dpItemFeaturesList#',$data)){
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
$desc .= $tot->nodeValue;
}
}
}
return $desc;
Looking at the code you have shared here and consequently having a look at the data that you are processing (a sample of which I have pasted here) you actually want to collect the text within the <li> child elements of the <ul class="dpItemFeaturesList"> node.
In your original code snippet your XPath is as follows:
'//ul[#class="dpItemFeaturesList"]'
This will only select the <ul> element and not the child elements. Consequently when you try to do a $tot->nodeValue it will concatenate all the text within all it's child nodes without spaces (ah ha, the real reason why you want spaces in the first place).
To fix this we should do two things:
Select the <li> nodes within the appropriate node. Change the XPath to //ul[#class="dpItemFeaturesList"]/li.
In the foreach loop concatenate 2 non-breakable spaces (because this is HTML) to the $desc variable.
Here $c is the array index.
function get_description($asin){
$url = 'http://www.amazon.com/gp/aw/d/' . $asin . '?d=f&pd=1';
$data = request_data($url);
$desc = '';
if ($data) {
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
if (preg_match('#dpItemFeaturesList#',$data)){
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]/li');
foreach ($k as $c => $tot) {
if ($c > 0) {
$desc .= " ";
}
$desc .= $tot->nodeValue;
}
}
}
return $desc;
}
We check for $c > 0 so that you will not get extra spaces after the last node in the loop.
P.S.: Unrelated to your original question. The code for which you shared a link has an undefined variable $timestamp in $date = date("format", $timestamp); on line 116.
Since you're appending everything to desc, try something like
$desc .= $tot->nodeValue;
$desc .= "<br />"
try that:
$desc .= $tot->nodeValue.' ';
and trim($desc) after the loop to avoid two spaces at the end.
or, alternatively create an array:
$desc = array();
//....
$desc[] = $tot->nodeValue;
and return implode(' ', $desc)
If you need that between each one, you need to add in front on each iteration but the first:
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
$c && $desc .= ' '; # all but first
$desc .= $tot->nodeValue;
}
This is an expression which saves you an if but it works similar. Maybe a bit of taste so sure, an if can do it as well:
$k = $xpath->query('//ul[#class="dpItemFeaturesList"]');
foreach ($k as $c => $tot) {
if($c) $desc .= ' '; # all but first
$desc .= $tot->nodeValue;
}
This works because every integer number but zero is true in PHP.
See the demo.
I have inherited some PHP code (but I've little PHP experience) and can't find how to count some elements in the object returned by simplexml_load_file()
The code is something like this
$xml = simplexml_load_file($feed);
for ($x=0; $x<6; $x++) {
$title = $xml->channel[0]->item[$x]->title[0];
echo "<li>" . $title . "</li>\n";
}
It assumes there will be at least 6 <item> elements but sometimes there are fewer so I get warning messages in the output on my development system (though not on live).
How do I extract a count of <item> elements in $xml->channel[0]?
Here are several options, from my most to least favourite (of the ones provided).
One option is to make use of the SimpleXMLIterator in conjunction with LimitIterator.
$xml = simplexml_load_file($feed, 'SimpleXMLIterator');
$items = new LimitIterator($xml->channel->item, 0, 6);
foreach ($items as $item) {
echo "<li>{$item->title}</li>\n";
}
If that looks too scary, or not scary enough, then another is to throw XPath into the mix.
$xml = simplexml_load_file($feed);
$items = $xml->xpath('/rss/channel/item[position() <= 6]');
foreach ($items as $item) {
echo "<li>{$item->title}</li>\n";
}
Finally, with little change to your existing code, there is also.
$xml = simplexml_load_file($feed);
for ($x=0; $x<6; $x++) {
// Break out of loop if no more items
if (!isset($xml->channel[0]->item[$x])) {
break;
}
$title = $xml->channel[0]->item[$x]->title[0];
echo "<li>" . $title . "</li>\n";
}
The easiest way is to use SimpleXMLElement::count() as:
$xml = simplexml_load_file($feed);
$num = $xml->channel[0]->count();
for ($x=0; $x<$num; $x++) {
$title = $xml->channel[0]->item[$x]->title[0];
echo "<li>" . $title . "</li>\n";
}
Also note that the return of $xml->channel[0] is a SimpleXMLElement object. This class implements the Traversable interface so we can use it directly in a foreach loop:
$xml = simplexml_load_file($feed);
foreach($xml->channel[0] as $item {
$title = $item->title[0];
echo "<li>" . $title . "</li>\n";
}
You get count by count($xml).
I always do it like this:
$xml = simplexml_load_file($feed);
foreach($xml as $key => $one_row) {
echo $one_row->some_xml_chield;
}