Sanitize an external xml file with php - php

Here's my problem:
function parse_xml_to_json($url) {
$fileContents = file_get_contents($url);
$simpleXml = simplexml_load_string($fileContents, null
, LIBXML_NOCDATA);
$json = json_encode($simpleXml);
return $json;
}
$jsonto = parse_xml_to_json('myxmlfile.html');
echo $jsonto;
Essentially I need to use an XML file from an external source and loop it through to display nicely some data.
I created a function that gets content from the external URL (file_get_contents), then I turn the string of XML into an object (I use LIBXML_NOCDATA as a parameter because it contains ), right after I turn the object into a JSON file and for the very last step, I echo the result.
So far so good, it worked but I'm wondering if I can do anything if the XML file contains a malicious script or else.
Is the function simplexml_load_string and then the JSON encode enough to prevent a malicious script or an invalid XML?

You code is prone to a Denial of Service (DOS) attack.
$fileContents = file_get_contents($url);
This can blow your memory limit. Or come close to, while taking a long time (the server you request the data from stales in the middle after providing a lot of content - and then only some little bytes each couple of seconds). So your script will "hang" while consuming the memory.
If the script can then be triggered with another HTTP request multiple times, this can consume your servers resources (the echo statement suggests this is entirely possible).

Related

Parallel downloads using simplexml_load_file in PHP

I want to use PHP to simultaneously download data from 2 URLs via simplexml_load_file but the script must wait until all data is gathered before going ahead processing the rest of the code.
$url1 = "http://www.example.com/api1";
$request1 = simplexml_load_file($url1);
$url2 = 'http://www.example.com/api2';
$request2 = simplexml_load_file("compress.zlib://$url2", NULL, TRUE);
echo 'finished';
I want all data is completely downloaded before printing the word finished.
How would you edit the script above to accomplish that?
Fetching URLs directly while opening "files" with functions such as simplexml_load_file is intended as a short-cut for simple cases where you don't need things like non-blocking / asynchronous I/O.
Your script as written will wait for everything to download before printing the word "finished", but it will also wait for the response from http://www.example.com/api1 to finish downloading before starting the request to http://www.example.com/api2.
You will need to break your problem down:
Download the contents of two URLs, in parallel (or more accurately "asynchronously"). Your result will be two strings.
Parse each of those strings using simplexml_load_string.
The most popular HTTP library for PHP is Guzzle, but you should be able to find many alternatives, and guides to writing your own using the built-in cURL functions if you search for terms like "PHP asynchronous HTTP" or "PHP parallel HTTP requests".

Efficient handling of IO in PHP

I'm using php, and handling file operation. My server need to response for every client request(concurrently minimum 5000 clients), My server open a xml file and convert xml to php array and will do some calculations and response as a json file, For this i'm using below code
$xmlstring = file_get_contents("../api/rate.xml");
$xml = simplexml_load_string($xmlstring);
$rate_json = json_encode($xml);
$rate_array = json_decode($rate_json, true);
$return_rates = array();
$return_rates['Baserates'] = $rate_array['Baserates'];
/* Here i will do some process and create $return_rates array */
$return_rates = json_encode($return_rates);
echo $return_rates;
This code producre resulat as i need, but i'm getting some times 500 internal server error, because of issue of handling IO(My server people saying this issue for internal server error). When many concurrent access happens in reading the file this issue happening, Please help any once to solve this issue.
XML File will be produce by third party application. For every second i'm receiving this XML file from one of my third party application.
You need to execute this code only when rate.xml changed. For first time, check if rate.xml content has changed by using hash value. If hash value is not same with last checked hash value then execute following code and store json file in local storage and new hash value.
$xmlstring = file_get_contents("../api/rate.xml");
$xml = simplexml_load_string($xmlstring);
$rate_json = json_encode($xml);
Otherwise you just need to read local json file and decode it. Or you can simply read from $xml without call to json_encode. Or you can request third party app developer to add JSON output instead of XML.
Also instead of file_get_contents(), you can use cUrl and sent HTTP HEAD first to find if rate.xml size changed instead of directly read rate.xml contents. When they do changed then you can call HTTP GET to retrieve their content.
The goal is to minimize network I/O and file I/O. Try cache rate.xml as long as possible in local storage or RAM (try Redis or Memcached)

Status report during form process

I created a little script that imports wordpress posts from an xml file:
if(isset($_POST['wiki_import_posted'])) {
// Get uploaded file
$file = file_get_contents($_FILES['xml']['tmp_name']);
$file = str_replace('&', '&', $file);
// Get and parse XML
$data = new SimpleXMLElement( $file , LIBXML_NOCDATA);
foreach($data->RECORD as $key => $item) {
// Build post array
$post = array(
'post_title' => $item->title,
........
);
// Insert new post
$id = wp_insert_post( $post );
}
}
The problem is that my xml file is really big, and when i submit the form, the browser just hangs for a couple of minutes.
Is it possible to display some messages during the import, like displaying a dot after every item is imported?
Unfortunately, no, not easily. Especially if you're building this on top of the WP framework you'll find it not worth your while at all. When you're interacting with a PHP script you are sending a request and awaiting a response. However long it takes that PHP script to finish processing and start sending output is how long it usually takes the client to start seeing a response.
There are a few things to consider if what you want is for output to start showing as soon as possible (i.e. as soon as the first echo or output statement is reached).
Turn off output buffering so that output begins sending immediately.
Output whatever you want inside the loop that would indicate to you the progress you wish to be know about.
Note that if you're doing this with an AJAX request content may not be ready immediately to transport to the DOM via your XMLHttpRequest object. Also note that some browsers do their own buffering before content can be ready for the user to display (like IE for example).
Some suggestions you may want to look into to speed up your script, however:
Why are you doing str_replace('&','&',$file) on a large file? You realize that has cost with no benefit, right? You've acomplished nothing and if you meant you want to replace the HTML entity & then you probably have some of your logic very wrong. Encoding is something you want to let the XML parser handle.
You can use curl_multi instead of file_get_contents to do multiple HTTP requests concurrently to save time if you are transferring a lot of files. It will be much faster since it's none-blocking I/O.
You should use DOMDocument instead of SimpleXML and a DOMXPath query can get you your array much faster than what you're currently doing. It's a much nicer interface than SimpleXML and I always recommend it above SimpleXML since in most cases SimpleXML makes things incredibly difficult to do and for no good reason. Don't let the name fool you.

Comparing XML documents for changes in PHP

Currently I'm using PHP to load multiple XML files from around the web (non-local) using simplexml_load_file(). This, as you can imagine, is quite a clunky process and is slowing load time significantly (7 seconds to load 7 files), and there could possibly be more files to load. These files don't change often, but changes should be displayed on the page as soon as they are made.
One idea I had was to cache a version of each feed and the html output I generate from that feed in my DB. Then, each time the user loads the page, the feeds would be compared; if they are different I would run my existing code, generate the HTML, output it, and save it to the DB. However, if it is the same, I could simply output the cached HTML.
My two concerns with this are:
Security: If I am storing a copy of an XML file, could this pose a security threat, seeing as I don't control the content of that file?
Speed: The main goal here is to increase the speed of the overall page load. Would the process described above increase the speed, or would it just bog down the server with more to do? Thanks for your help!
How about having a cron job crawl through every external XML source, say, hourly or quarter-hourly and update it if necessary?
It wouldn't be in 100% real time, but would take the load off your web page - that would always be using cached files. I don't think there is a reliable way of polling external sources for updates other than actually downloading the file (in theory, it should be possible to get the correct cache headers, but I wouldn't rely on them being configured correctly.)
Security: If I am storing a copy of an XML file, could this pose a security threat, seeing as I don't control the content of that file?
Hardly. To make totally sure, store the cached XML files outside the web root. The any threat that remains then is the same as if you were passing the stream through live.
One idea I had was to cache a version of each feed and the html output I generate from that feed in my DB. Then, each time the user loads the page, the feeds would be compared; if they are different I would run my existing code, generate the HTML, output it, and save it to the DB. However, if it is the same, I could simply output the cached HTML.
Rather than caching the XML file yourself, you should set the If-None-Match or If-Modified-Since fields in the request header. This way you can check to see if the files have changed without necessarily downloading them.
This can be done by setting a stream context for libxml before running simplexml_load_file(). If the file hasn't changed, you'll get a 304 Not Modified response, and simplexml_load_file will fail.
You could also use stream_context_get_default to set the general stream context, then retrieve the XML file into a string with file_get_contents and pass it to simplexml_load_string().
Here's an example of the first way:
Class CachedXml {
public $element,$url;
private $mod_date, $etag;
public function __construct($url){
$this->url = $url;
$this->element = NULL;
$this->mod_date = FALSE;
$this->etag = FALSE;
}
public function updateXml(){
if($this->mod_date || $this->etag){
$opts = array(
'http'=>array(
'header'=>"If-Modified-Since: $this->mod_date\r\n" .
"If-None-Match: $this->etag\r\n"
)
);
$context = stream_context_create($opts);
libxml_set_streams_context($context);
}
if($attempt = # simplexml_load_file($this->url)){
$this->element = $attempt;
$headers = get_headers($this->url,1);
$this->mod_date = $headers['Last-Modified'];
$this->etag = $headers['ETag'];
return TRUE;
}
return FALSE;
}
}
$bob = new CachedXml('http://example.com/xml/test.xml');
if($bob->updateXml()){
echo "Bob was just updated.<br />";
echo " Bob's name is " . $bob->element->getName() . ".<br />";
}
else{
echo "Bob was not updated.<br />";
}

how to hide a json object from source code of a page?

iam using json object in my php file but i dont want my json object to be displayed in source code as it increases my page size a lot.
this is what im doing in php
$json = new Services_JSON();
$arr = array();
$qs=mysql_query("my own query");
while($obj = mysql_fetch_object($qs))
{
$arr[] = $obj;
}
$total=sizeof($arr);
$jsn_obj='{"abc":'.$json->encode($arr).',"totalrow":"'.$total.'"}';
and this is javascript
echo '<script language=\'javascript\'>
var dataref = new Object();
dataref = eval('.$jsn_obj.');
</script>';
but i want to hide this $jsn_obj objects value from my source,how can i do that??? plz help !!
I'm not sure there's a way around your problem, other than to change your mind about whether it's a problem at all (it's not, really).
You can't use the JSON object in your page if you don't output it. The only other way to get the object would be to make a separate AJAX request for it. If you did it that way, you're still transferring the exact same number of bytes that you would have originally, but now you've added the overhead of an extra HTTP request (which will be larger than it would have been originally, since there are now HTTP headers on the transfer). This way would also be slower on your page load, since you'd have to load the page, then send the AJAX request and run the result.
There's much better ways to manage the size of your pages. JSON is just text, so you should look into a server-side solution to zip your content, like mod_deflate. mod_deflate works beautifully on dynamic PHP output as well as static pages. If you don't have control over your web server, you could use PHP's built in zlib compression.
Instead of writing the JSON date directly to the document instead you can use an XMLHttpRequest in or use a library like JQuery to load the JSON data during script runtime.
It depends largely on your json data. If the data you're printing inline in the html is huge you might wanna consider using ajax to load the json data. That is assuming you wanted your page to be loaded faster, even without data.
If the data isn't that big, try to keep the data inline, without making extra http requests. To speed up your page, try using YSlow! to see what other areas you could optimize.

Categories