Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm trying to implement functionality to edit a XML-based news feed from a PHP-powered web app. However, it doesn't seem to ever save.
The XML file I'm working with is as such:
<?xml version="1.0" standalone="yes"?>
<issues>
<issue>
<issue_id>1</issue_id>
<issue_name>Don't double my rates!</issue_name>
<issue_body>Congress is on the verge of letting student rates double a week from today. Swing by the UC Lawn at 5:00 this Thursday to reach out to our Representatives and tell them: #DontDoubleMyRates!</issue_body></issue>
<issue>
<issue_id>2</issue_id>
<issue_name>Proposed Senate Budget</issue_name>
<issue_body>College Democrats are baffled by the proposed senate budget. This is our state, we must make our opinions heard! #NCGOPBudget #StopCuts</issue_body></issue>
<issue>
<issue_id>3</issue_id>
<issue_name>Voter Suppression Law Invalidated!</issue_name>
<issue_body>Join us in applauding the US Supreme Court for invalidating Arizona's voter-suppression law requiring that voters present proof of citizenship before voting!</issue_body></issue>
<issue>
<issue_id>4</issue_id>
<issue_name>Here's an actual article I found interesting</issue_name>
<issue_body>Actually, not really beacause I really didn't want to google for some arbitrary article to help test this out so here's a bunch of filler text to hopefully emulate at least the by-line of an article pertaining to the democratic party organization here on campus.</issue_body>
</issue>
</issues>
Here is the relevant php script that tries to edit the pre-existing node:
<?php
$newName = $_POST['name'];
$newBody = $_POST['body'];
$issue_id = $_POST['edit'];
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('issues.xml');
$xpath = new DOMXPath($dom);
$query = '/issues/issue';
foreach($xpath->query($query) as $issue) {
$id = $issue->parentNode->getElementsByTagName("issue_id");
if($id->item($issue_id)->nodeValue = $issue_id) {
$name = $issue->parentNode->getElementsByTagName("issue_name");
$body = $issue->parentNode->getElementsByTagName("issue_body");
$name->item($issue_id-1)->nodeValue = '$newName';
$body->item($issue_id-1)->nodeValue = '$newBody';
break;
}
}
$dom->save("issues.xml");
?>
Here is the referring page which iterates through the child nodes until it finds the previously selected node's ID and then displays its info in a table.
<?php
$issue_id = $_POST['edit'];
$issueArray = array(
'id' =>$_POST['id'],
'issue_name' => $_POST['issue_name'],
'issue_body' => $_POST['issue_body'],
);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('issues.xml');
$xpath = new DOMXPath($dom);
$query = '/issues/issue';
$i = 0;
echo "<body><form action='saveChanges.php' method='post'><table border='1'><tr><th>ID</th><th>Name</th><th>Body</th></tr>";
foreach($xpath->query($query) as $issue) {
$eventI = $issue->parentNode->getElementsByTagName("issue_id");
if($eventI->item($issue_id)->nodeValue = $issue_id) {
$eventN = $issue->parentNode->getElementsByTagName("issue_name");
$eventP = $issue->parentNode->getElementsByTagName("issue_body");
print "<tr><td>'".$eventI->item($issue_id-1)->nodeValue."'></td><td>'".$eventN->item($issue_id-1)->nodeValue."'></td><td>'".$eventP->item($issue_id-1)->nodeValue."'</td></tr>";
print "<tr><td></td><th>New Name</td><th>New Body</td></tr>";
print "<tr><td></td><td><input type='text' name='name'size='50'</input></td><td><input type='text' name='body' size='200'</input></td></tr>";
print "<tr><td><input type='hidden' name='id' value='$issue_id'/></td><th><input type='submit' action='saveChanges.php' name='edit' method='post' value='Confirm Edit'/></th><th></th>";
break;
}
}
print "</table></body>";
?>
I'm not that great at PHP, and even worse at parsing XML, any help to get this going in the right direction would be great!
There are all sorts of problems in the code that is manipulating the DOM. Just looking at the contents of the for loop, you start with this:
$id = $issue->parentNode->getElementsByTagName("issue_id");
In the line above, you have taken the $issue that you enumerated in the for loop and then referenced its parent, which is the same for every issue, thus making the enumeration irrelevant.
You're then getting all issue_id elements in that tree, with which you do this:
if($id->item($issue_id)->nodeValue = $issue_id) {
Here you are using the $issue_id as an index, which assumes that an issue_id of 3 (for example) would always be the third issue, which probably isn't true.
Also a single = is an assignment, not a comparison, which I'm sure was not your intention.
The $name and $body lookups are the same:
$name = $issue->parentNode->getElementsByTagName("issue_name");
$body = $issue->parentNode->getElementsByTagName("issue_body");
Again you're ignoring the $issue that has been enumerated and are working from the parent node, and then just getting all the child elements that match issue_name and issue_body.
And again you're using the $issue_id as an index:
$name->item($issue_id-1)->nodeValue = '$newName';
$body->item($issue_id-1)->nodeValue = '$newBody';
This time, though, you're using $issue_id-1 - was there a reason for that?
Also when you use single quotes for a string in php, that doesn't expand the variable, so the name will always be set to the literal string $newName rather than value of that variable. You should either use double quotes, or better still, just assign the value directly.
This is more like what I would expect the code to look like:
foreach($xpath->query($query) as $issue) {
$id = $issue->getElementsByTagName("issue_id")->item(0);
if($id->nodeValue == $issue_id) {
$name = $issue->getElementsByTagName("issue_name")->item(0);
$body = $issue->getElementsByTagName("issue_body")->item(0);
$name->nodeValue = $newName;
$body->nodeValue = $newBody;
break;
}
}
The rest of your code has more of the same problems, but hopefully that will point you in the right direction.
Related
I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div.
This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[#class='market']";
$query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";?
could work, though I don't know how to use that exactly.
Thanks!!
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(#class,'market_price')]";
But the idea is the same in both cases, using contains(source,value) will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Try to use below XPath expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[#class="market_item market_price"]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am totally new to PHP development and I would like to extract the contents of a meta tag.
I have this code that allows me to extract the contents of the element # squad.
// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");
// Settings on top
$sitesToCheck = array(
// id is the page ID for selector
array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
);
$savePath = "cachedPages/";
$emailContent = "";
// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];
// Calculate the cachedPage name, set oldContent = "";
$fileName = md5($url);
$oldContent = "";
// Get the URL's current page content
$html = file_get_html($url);
// Find content by querying with a selector, just like a selector engine!
foreach($html->find($site["selector"]) as $element) {
$currentContent = $element->plaintext;;
}
// If a cached file exists
if(file_exists($savePath.$fileName)) {
// Retrieve the old content
$oldContent = file_get_contents($savePath.$fileName);
}
// If different, notify!
if($oldContent && $currentContent != $oldContent) {
// Build simple email content
$emailContent = "Hey, the following page has changed!\n\n".$url."\n\n";
}
// Save new content
file_put_contents($savePath.$fileName,$currentContent);
}
// Send the email if there's content!
if($emailContent) {
// Sendmail!
mail("me#myself.name","Sites Have Changed!",$emailContent,"From: alerts#myself.name","\r\n");
// Debug
echo $emailContent;
}
But I want to change this code to get the number of comments in income.
Here is the meta tag where i would just extract the number of comments :
<meta item="desc" content="Comments:645">
Am I clear enough, do you understand me?
If I am not explicit enough, ask me?
Thanks for help
There's two ways to do this. You could either use the native PHP function: get_meta_tags() like so:
$tags = get_meta_tags('http://yoursite.com');
$comments = $tags['desc'];
Or you could use RegEx, but the above would be much more practical.
What you are looking for might be screen scraping.
This is the process where a programming-language like php, python or ruby loads a website in memory and uses various selectors to grab content from it.
Screen scraping is mostly used on websites that feature a lot of interesting data but have no json or xml API's
having googled around for it I stumbled on this post:
PHP equivalent of PyQuery or Nokogiri?
This article explains more about screen-scraping for web:
http://en.wikipedia.org/wiki/Web_scraping
Look for use domDocument
$dom = new domDocument;
$dom->loadHTML($htmlPage);
$metas = $dom->documentElement->getElementsByTagName('meta');
$ar = array();
foreach ($metas as $meta) {
$name = $meta->getAttribute('name');
$value = $meta->getAttribute('content');
$ar[$name] = $value;
}
print_r($ar); // print array meta-values
I am trying to use DOM to get the days and times and also the rooms (im actually getting everything in my script but its getting these two im having trouble with) from the following batch of HTML:
</td><td class="call">
<span>12549<br/></span>View Book Info
</td><td>
<span id="ctl10_gv_sectionTable_ctl03_lblDays">F:1000AM - 1125AM<br />T:230PM - 355PM</span>
</td><td class="room">
<span id="ctl10_gv_sectionTable_ctl03_lblRoom">KUPF106<br />KUPF106</span>
</td><td class="status"><span id="ctl10_gv_sectionTable_ctl03_lblStatus" class="red">Closed</span></td><td class="max">20</td><td class="now">49</td><td class="instructor">
Schoenebeck Kar
</td><td class="credits">3.00</td>
</tr><tr class="sectionRow">
<td class="section">
101<br />
Here is what I have so far for finding days
$tracker =0;
// DAYS AND TIMES
$number = 3;
$digit = "0";
while($tracker<$numSections){
$strNum = strval($number);
$zero = strval($digit);
$start = "ctl10_gv_sectionTable_ctl";
$end = "_lblDays";
$id = $start.$zero.$strNum.$end;
//$days = $html->find('span.$id');
$days=$html->getElementByTagName('span')->getElementById($id);
echo "Days : ";
echo $days[0] . '<br>';
$tracker++;
$number++;
if($number >9){
$digit = "1";
$number=0;
}
}
as you can see from the HTML, the site im parsing has pretty unique ID's for some of its spans (ctl10_gv_sectionTable_ctl03_lblRoom). As I only posted 1 section's HTML block, what you don't see is that the code for the next class section is identical except for the "ctl03" part, which is what all the extra code I have takes care of, just so no one is thrown off by it.
I've tried a few different ways but can not seem to get the days (i.e. "1000AM - 1125AM") or the rooms (i.e. KUPF106). The rest of the stuff is pretty simple to grab but these two don't have class identifiers or even a td identifier. I think I just need to know how to use the value I have in $id as the specific span id I am looking for? If so can someone show me how to do that?
This:
$html->getElementByTagName('span')->getElementById($id);
makes no sense. getElementByTagName returns a DOMList, which does not have a getElementById method.
I think you mean $html->getElementById($id);, but I can't be sure because I don't know what $html is.
Once you have the element, you can get the text value with $element->textContent if you don't need to walk among the text nodes.
Have you considered using DOMXPath for your parsing task? It's probably much easier and clearer.
Simple Html Dom should be avoided unless you're using Php version <= 4. The built in Dom functions in Php5 use the much more reliable libxml2 library.
The proper way to iterate that html is to first identify the rows to iterate and then write xpath expressions to pull the data relative to that row.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DomXpath($dom);
foreach($xpath->query("//tr[#class='sectionRow']") as $row){
echo $xpath->query(".//span[contains(#id,'Days')]",$row)->item(0)->nodeValue."\n";
echo $xpath->query(".//span[contains(#id,'Room')]",$row)->item(0)->nodeValue."\n";
echo $xpath->query(".//span[contains(#id,'Status')]",$row)->item(0)->nodeValue."\n";
}
Can you please help me with the correct syntax to use when you want to check the innerHTML/nodeValue of an element?
I have no problem with the Name however the Age is within a plain div element, What is the correct syntax to use in place of "NOT SURE WHAT TO PUT HERE" below.
$html is a page from the internet
The persons name is in a span like:
<span class="fullname">John Smith</span>
The persons age is in a div like:
<div>Age: 28</div>
I have the following PHP:
<?php
$dom = new DomDocument();
#$dom->loadHTML($html);
$finder = new DOMXPath($dom);
//Full Name
$findName = "fullname";
$queryName = $finder->query("//span[contains(#class, '$findName')]");
$name = $queryName->item(0)->nodeValue;
//Age
$findAge = "Age: ";
$queryAge = $finder->query("//div[NOT SURE WHAT TO PUT HERE]");
$age = substr($queryAge->item(0)->nodeValue, 5);
?>
Try
$queryAge = $finder->query("//div[starts-with(., '$findAge')]");
I've had limited success with starts-with() due to whitespace so you may have to resort to
$queryAge = $finder->query("//div[contains(., '$findAge')]");
If there's a chance of finding false positives (ie, other divs with "Age: " in them), you might be able to avoid that by using a more specific path (if known), ie
$queryAge = $finder->query("//div[#id='something']//div[contains(., '$findAge')]");
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm currently using Magpie RSS but it sometimes falls over when the RSS or Atom feed isn't well formed. Are there any other options for parsing RSS and Atom feeds with PHP?
I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.
Consider this very rough, simple class using SimpleXML:
class BlogPost
{
var $date;
var $ts;
var $link;
var $title;
var $text;
}
class BlogFeed
{
var $posts = array();
function __construct($file_or_url)
{
$file_or_url = $this->resolveFile($file_or_url);
if (!($x = simplexml_load_file($file_or_url)))
return;
foreach ($x->channel->item as $item)
{
$post = new BlogPost();
$post->date = (string) $item->pubDate;
$post->ts = strtotime($item->pubDate);
$post->link = (string) $item->link;
$post->title = (string) $item->title;
$post->text = (string) $item->description;
// Create summary as a shortened body and remove images,
// extraneous line breaks, etc.
$post->summary = $this->summarizeText($post->text);
$this->posts[] = $post;
}
}
private function resolveFile($file_or_url) {
if (!preg_match('|^https?:|', $file_or_url))
$feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $file_or_url;
else
$feed_uri = $file_or_url;
return $feed_uri;
}
private function summarizeText($summary) {
$summary = strip_tags($summary);
// Truncate summary line to 100 characters
$max_len = 100;
if (strlen($summary) > $max_len)
$summary = substr($summary, 0, $max_len) . '...';
return $summary;
}
}
With 4 lines, I import a rss to an array.
$feed = implode(file('http://yourdomains.com/feed.rss'));
$xml = simplexml_load_string($feed);
$json = json_encode($xml);
$array = json_decode($json,TRUE);
For a more complex solution
$feed = new DOMDocument();
$feed->load('file.rss');
$json = array();
$json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
$json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
$json['item'] = array();
$i = 0;
foreach($items as $key => $item) {
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;
$json['item'][$key]['title'] = $title;
$json['item'][$key]['description'] = $description;
$json['item'][$key]['pubdate'] = $pubDate;
$json['item'][$key]['guid'] = $guid;
}
echo json_encode($json);
Your other options include:
SimplePie
Last RSS
PHP Universal Feed Parser
I would like introduce simple script to parse RSS:
$i = 0; // counter
$url = "http://www.banki.ru/xml/news.rss"; // url to parse
$rss = simplexml_load_file($url); // XML parser
// RSS items loop
print '<h2><img style="vertical-align: middle;" src="'.$rss->channel->image->url.'" /> '.$rss->channel->title.'</h2>'; // channel title + img with src
foreach($rss->channel->item as $item) {
if ($i < 10) { // parse only 10 items
print ''.$item->title.'<br />';
}
$i++;
}
If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.
Otherwise you're paving way to mess that HTML ended up in.
The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.
I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set.
Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)
Personally I use BNC Advanced Feed Parser- i like the template system that is very easy to use
The PHP RSS reader - http://www.scriptol.com/rss/rss-reader.php - is a complete but simple parser used by thousand of users...
Another great free parser - http://bncscripts.com/free-php-rss-parser/
It's very light ( only 3kb ) and simple to use!