I have a short script that utilizes the XML_Query2XML PEAR package. It pulls data from a SQL database and outputs to the browser. The XML that appears in the browser is exactly what I want to be saved to a file, but any attempts to use ob_get_contents or any of the other methods I'm familiar with result in a blank output file. The code is as follows:
<?php
set_include_path('/Library/WebServer/Documents/PEAR/');
include 'XML/Query2XML.php';
include 'MDB2.php';
try {
// initialize Query2XML object
$q2x = XML_Query2XML::factory(MDB2::factory('mysql://root:pass#site.com/site'));
$sql = "SELECT * FROM Products";
$xml = $q2x->getFlatXML($sql);
header('Content-Type: text/xml');
$xml->formatOutput = true;
echo $xml->saveXML();
} catch (Exception $e) {
echo $e->getMessage();
}
?>
I'm wondering what the general procedure is for saving files with this plugin and output type (XML). Any help is greatly appreciated.
The $xml variable is a DOMDocument object, which means you can use its methods to save it into a file, e.g. save:
$xml->save('foo.xml');
Related
I am trying to scrape a remote website and edit parts of the results before updating a couple of tables in the database and subsequently echo()'ing the final document.
Here's a redacted snippet of the code in question for reference:
<?php
require_once 'backend/connector.php';
require_once 'table_access/simplehtmldom_1_5/simple_html_dom.php';
require_once 'pronunciation1.php';
// retrieve lookup term
if(isset($_POST["lookup_term"])){ $term = trim($_POST["lookup_term"]); }
else { $term = "hombre"; }
$html = file_get_html("http://www.somesite.com/translate/" . rawurlencode($term));
$coll_temp = $html->find('div[id=translate-en]');
$announce = $coll_temp[0]->find('.announcement');
$quickdef = $coll_temp[0]->find('.quickdef');
$meaning = $announce[0] . $quickdef[0];
$html->clear(); // release scraper variable to prevent memory leak issues
unset($html); // release scraper variable to prevent memory leak issues
$meaning = '<?xml version="1.0" encoding="ISO-8859-15"?>' . $meaning;
// process the newly-created DOM
$dom = new DOMDocument;
$dom->loadHTML($meaning);
// various DOM-manipulation code snippets
// extract the quick definition section
foreach ($dom->find('div[class=quickdef]') as $qdd) {
$qdh1 = $qdd->find('.source')[0]->find('h1.source-text');
$qdterm = $qdh1[0]->plaintext;
$qdlang = $qdh1[0]->getAttribute('source-lang');
add2qd($qdterm, $qdd, $qdlang);
unset($qdterm);
unset($qdlang);
unset($qdh1);
}
$finalmeaning = $dom->saveHTML(); // store processed DOM in $finalmeaning
push2db($term, $finalmeaning); // add processed DOM to database
echo $finalmeaning; // output processed DOM
// release variables
unset($dom);
unset($html);
unset($finalmeaning);
function add2qd($lookupterm, $finalqd, $lang){
$connect = dbconn(PROJHOST, CONTEXTDB, PEPPYUSR, PEPPYPWD);
$sql = 'INSERT IGNORE INTO tblquickdef (word, quickdef, lang) VALUES (:word, :quickdef, :lang)';
$query = $connect->prepare($sql);
$query->bindParam(':word', $lookupterm);
$query->bindParam(':quickdef', $finalqd);
$query->bindParam(':lang', $lang);
$query->execute();
$connect = null;
}
function push2db($lookupword, $finalmean) {
$connect = dbconn(PROJHOST, DICTDB, PEPPYUSR, PEPPYPWD);
$sql = 'INSERT IGNORE INTO tbldict (word, mean) VALUES (:word, :mean)';
$query = $connect->prepare($sql);
$query->bindParam(':word', $lookupword);
$query->bindParam(':mean', $finalmean);
$query->execute();
$connect = null;
}
?>
The code works fine except for the for loop under the // extract the quick definition section. The function being called from inside this loop is add2qd() which accepts 3 string values as input.
Every time this loop runs, PHP throws a fatal error because it thinks find() is undefined. I know find is a legitimate function in the PHP Simple HTML DOM Parser library because I have used it multiple times in the same code without any problem (in the //retrieve lookup term section). What am I doing wrong?
But your are not using the PHP Simple HTML DOM - only standard PHP DOMDocument, which does not have the method find.
$dom = new DOMDocument;
$dom->loadHTML($meaning);
foreach ($dom->find('div[class=quickdef]') as $qdd) {
http://php.net/manual/en/class.domdocument.php
everyone, I've been using this code for quite a long time
<?php
$url = 'http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// search for td's containing METAR
$metars = $xpath->query('//td[contains(text(), "METAR SA")]');
if($metars->length <= 0) {
echo 'no metars found';
exit;
}
$data = array();
foreach($metars as $metar) {
$data[] = $metar->nodeValue;
}
echo '<pre>';
print_r($data);
Well, this was working fine, until the program in charge to read the output was updated and now it needs a clear output.
At the momment I'm getting this:
http://ar.ivao.aero/weather/metar.php
But the program needs it like this:
SABE 161600Z 02006KT 9999 FEW030 24/18 Q1009 =
SAZA 161600Z 18011KT CAVOK 24/08 Q1010 =
SAZB 161700Z 27012KT CAVOK 21/09 Q1011 =
I don't thought maybe using another script like a file_get_content() could be useful but again its going to show the infromation I don't want to.
I also tried replacing print_r() by var_dump() but its the same
Any ideas?
There is anyway to get this informatin in a simple txt file?
Regards,
You need to filter out some data. Try to find out what's common in the info you need to output. For instance, all the required info from your raw print_r data seems to beging with METAR. So
echo '<pre>';
foreach($metars as $metar) {
if(substr($metar->nodeValue, 0, 5) === "METAR") {
echo str_replace("METAR ", "", $metar->nodeValue) . PHP_EOL;
}
}
That removes any lines like Aeropuerto FORMOSA from the output.
I have made this:
<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(document).ready(
function()
{
$("body").html($("#HomePageTabs_cont_3").html());
}
);
</script>
</head>
<body>
<?php
echo file_get_contents("http://www.bankasya.com.tr/index.jsp");
?>
</body>
</html>
When I check my page with Firebug, It gives countless "missing files" (images, css files, js files, etc.) errors. I want to have just a part of the page not of all. This code does what I want. But I am wondering if there is a better way.
EDIT:
The page does what I need. I do not need all the contents. So iframe is useless to me. I just want the raw data of the div #HomePageTabs_cont_3.
Your best bet is PHP server-side parsing. I have written a small snippet to show you how to do this using DOMDocument (and possibly tidyif your server has it, to barf out all the mal-formed XHTML foos).
Caveat: outputs UTF-8. You can change this in the constructor of DOMDocument
Caveat 2: WILL barf out if its input is neither utf-8 not iso-8859-9. The current page's charset is iso-8859-9 and I see no reason why they would change this.
header("content-type: text/html; charset=utf-8");
$data = file_get_contents("http://www.bankasya.com.tr/index.jsp");
// Clean it up
if (class_exists("tidy")) {
$dataTidy = new tidy();
$dataTidy->parseString($data,
array(
"input-encoding" => "iso-8859-9",
"output-encoding" => "iso-8859-9",
"clean" => 1,
"input-xml" => true,
"output-xml" => true,
"wrap" => 0,
"anchor-as-name" => false
)
);
$dataTidy->cleanRepair();
$data = (string)$dataTidy;
}
else {
$do = true;
while ($do) {
$start = stripos($data,'<script');
$stop = stripos($data,'</script>');
if ((is_numeric($start))&&(is_numeric($stop))) {
$s = substr($data,$start,$stop-$start);
$data = substr($data,0,$start).substr($data,($stop+strlen('</script>')));
} else {
$do = false;
}
}
// nbsp breaks it?
$data = str_replace(" "," ",$data);
// Fixes for any element that requires a self-closing tag
if (preg_match_all("/<(link|img)([^>]+)>/is",$data,$mt,PREG_SET_ORDER)) {
foreach ($mt as $v) {
if (substr($v[2],-1) != "/") {
$data = str_replace($v[0],"<".$v[1].$v[2]."/>",$data);
}
}
}
// Barf out the inline JS
$data = preg_replace("/javascript:[^;]+/is","#",$data);
// Barf out the noscripts
$data = preg_replace("#<noscript>(.+?)</noscript>#is","",$data);
// Muppets. Malformed comment = one more regexp when they could just learn to write proper HTML...
$data = preg_replace("#<!--(.*?)--!?>#is","",$data);
}
$DOM = new \DOMDocument("1.0","utf-8");
$DOM->recover = true;
function error_callback_xmlfunction($errno, $errstr) { throw new Exception($errstr); }
$old = set_error_handler("error_callback_xmlfunction");
// Throw out all the XML namespaces (if any)
$data = preg_replace("#xmlns=[\"\']?([^\"\']+)[\"\']?#is","",(string)$data);
try {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="utf-8"?>' : "").$data);
} catch (Exception $e) {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="iso-8859-9"?>' : "").$data);
}
restore_error_handler();
error_reporting(E_ALL);
$DOM->substituteEntities = true;
$xpath = new \DOMXPath($DOM);
echo $DOM->saveXML($xpath->query("//div[#id=\"HomePageTabs_cont_3\"]")->item(0));
In order of appearance:
Fetch the data
If we have tidy, sanitize HTML with it
Create a new DOMDocument and load our document ((string)$dataTidy is a short-hand tidy getter)
Create an XPath request path
Use XPath to request all divs with id set as what we want, get the first item of the collection (->item(0), which will be a DOMElement) and request for the DOM to output its XML content (including the tag itself)
Hope it is what you're looking for... Though you might want to wrap it in a function.
Edit
Forgot to mention: http://rescrape.it/rs.php for the actual script output!
Edit 2
Correction, that site is not W3C-valid, and therefore, you'll either need to tidy it up or apply a set of regular expressions to the input before processing. I'm going to see if I can formulate a set to barf out the inconsistencies.
Edit 3
Added a fix for all those of us who do not have tidy.
Edit 4
Couldn't resist. If you'd actually like the values rather than the table, use this instead of the echo:
$d = new stdClass();
$rows = $xpath->query("//div[#id=\"HomePageTabs_cont_3\"]//tr");
$rc = $rows->length;
for ($i = 1; $i < $rc-1; $i++) {
$cols = $xpath->query($rows->item($i)->getNodePath()."/td");
$d->{$cols->item(0)->textContent} = array(
((float)$cols->item(1)->textContent),
((float)$cols->item(2)->textContent)
);
}
I don't know about you, but for me, data works better than malformed tables.
(Welp, that one took a while to write)
I'd get in touch with the remote site's owner and ask if there was a data feed I could use that would just return the content I wanted.
Sébastien answer is the best solution, but if you want to use jquery you can add Base tag in head section of your site to avoid not found errors on images.
<base href="http://www.bankasya.com.tr/">
Also you will need to change your sources to absolute path.
But use DOMDocument
Im using simplexml to bring in a data feed, and now i want to put that into working varables to use in my php doc.
Following the php.net guides on simplexml ive arrived at
<?php
$xml = simplexml_load_file('f1_feed.xml');
$xml = new SimpleXMLElement($xmlstr);
echo $xml->response->williamhill->class->type->market[0]->name;
?>
but i keep getting a blank page, have i completely missed to point of how to parse the xml and put it into a working var ?
(feed is local for development)
you don't need both new SimpleXMLElement and simplexml_load_file:
simplexml_load_file Returns an object of class SimpleXMLElement
SimpleXMLElement Returns a SimpleXMLElement object
try:
if (file_exists('f1_feed.xml')) {
$xml = simplexml_load_file('f1_feed.xml');
print_r($xml);
} else {
exit('Failed to open f1_feed.xml.');
}
or:
if (file_exists('f1_feed.xml')) {
$xml = new SimpleXMLElement(file_get_contents('f1_feed.xml'));
echo $xml->response->williamhill->class->type->market[0]->name;
} else {
exit('Failed to open f1_feed.xml.');
}
if it still doesn't work, add
error_reporting(E_ALL);
ini_set("display_errors", 1);
for better error reporting
The problem is only happening with one file when I try to do a DocumentDOM/SimpleXML method, so it seems like the issue is with that file. No clue what it could be.
If I do the following:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
print_r($xml);
in Chrome, I get a "Page Unavailable" error. In Firefox, I get nothing.
If I do the same thing but to a "test2.html", I get a print out as expected.
If I try the same thing but doing it this way:
$file = "test1.html";
$data = file_get_contents($file)
$dom = DOMDocument::loadHTML($data);
$xml = simplexml_import_dom($dom);
print_r($xml);
I get the same issue.
If I comment out the print_r line, Chrome goes from the "Page Unavailable" to blank.
I changed the permissions to 777, in case that was an issue, no fix.
I tried simply echoing out the contents of the html, no problem at all.
Any clues as to why a) Chrome would do that, and b) why I'm not getting any usable results?
Update:
If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
$xml = simplexml_import_dom($dom);
print_r($xml);
}
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
if(!$dom) {
echo "No Load!";
}
else {
echo "Load!";
}
I get the "Load!" output, meaning that the dom method shouldn't be the problem (?)
I'll try the same exact test with the simplexml.
Update2:
If I do this:
I get the same issue. If I put in:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
}
I get "Load!" but if I do:
$file = "test1.html";
$dom = DOMDocument::loadHTMLFile($file);
$xml = simplexml_import_dom($dom);
if(!$xml) {
echo "No Load!";
}
else {
echo "Load!";
print_r($xml);
}
I get the error. I did finally notice that I had an option to view the error in Chrome:
Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.
The troublesome html file is 288Kb. Could that be the issue? If so, how would I adjust for that?
Last Update:
Very Odd. I can use methods and functions on the object (as simplexml or domdocument), so I can do things like xpath to delete or parse the html, etc. In some cases (small results) it can echo out results, but for big stuff (show all spans), it fails in the same way.
So, since the end result, I think will fit in these parameters, I SHOULD be okay (I guess).
But any real solution is very welcome.
Turn on error reporting: error_reporting(E_ALL); in the first line of your PHP code.
Check the memory limit of your PHP configuration: memory_limit in the respective php.ini
What's the difference between test1.html and test2.html? Perhaps test1.html is not well-formed.
DocumentDOM and/or SimpleXML may bail out if the document is malformed. Try something like:
$dom = DOMDocument::loadHTMLFile($file);
if (!$dom) {
echo 'Loading file failed';
exit;
}
$xml = simplexml_import_dom($dom);
if (!$xml) {
...
}
If creating the $dom worked, conversion to $xml should work as well, but make sure anyway.
Edit: As Gehrig said, make sure error reporting is on, that should make it obvious where the process fails.