I was wondering how i should go about writing an XML data layer for a fairly simple php web site. The reasons for this are:
db server is not available.
Simple data schema that can be expressed in xml.
I like the idea of having a self contained app, without server dependencies.
I would possibly want to abstract it to a small framework for reuse in other projects.
The schema resembles a simple book catalog with a few lookup tables plus i18n. So, it is quite simple to express.
The size of the main xml file is in the range of 100kb to 15mb. But it could grow at some point to ~100mb.
I am actually considering extending my model classes to handle xml data.
Currently I fetch data with a combination of XMLReader and SimpleXml, like this:
public function find($xpath){
while($this->xml_reader->read()){
if($this->xml_reader->nodeType===XMLREADER::ELEMENT &&
$this->xml_reader->localName == 'book' ){
$node = $this->xml_reader->expand();
$dom = new DOMDocument();
$n = $dom->importNode($node, true);
$dom->appendChild($n);
$sx = simplexml_import_dom($n);
// xpath returns an array
$res = $sx->xpath($xpath);
if(isset($res[0]) && $res[0]){
$this->results[] = $res;
}
}
return $this->results;
}
So, instead of loading the whole xml file in memory, I create a SimpleXml object for each section and run an xpath query on that object. The function returns an array of SimpleXml objects. For conservative search I would probably break on first found item.
The questions i have to ask are:
Would you consider this as a viable solution, even for a medium to large data store?
Are there any considerations/patterns to keep in mind, when handling XML in PHP?
Does the above code scale for large files (100mb)?
Can inserts and updates in large xml files be handled in a low overhead manner?
Would you suggest an alternative data format as a better option?
If you have a saw and you need to
pound in a nail, don't use the
saw. Get a hammer. (folk saying)
In other words, if you want a data store, use a data-base, not a markup language.
PHP has good support for various database systems via PDO; for small data sets, you can use SQLite, which doesn't need a server (it is stored in a normal file). Later, should you need to switch to a full-featured database, it is quite simple.
To answer your questions:
Viable solution - no, definitely not. XML has its purposes, but simulating a database is not one, not even for a small data set.
With XML, you're shuffling strings around, all the time. That might be just bearable on read, but is a real nightmare on write (slow to parse,large memory footprint, etc.). While you could subvert XML to work as a data store, it is simply the wrong tool for the job.
No (everything will take forever, if you don't run out of memory before that).
No, for many reasons (locking, re-writing the whole XML-string/file, not to mention memory again).
5a. SQLite was designed with very small and simple databases in mind - simple, no server dependencies (the db is contained in one file). As #Robert Gould points out in a comment, it doesn't scale for larger applications, but then
5b. for a medium to large data store, consider a relational database (and it is usually easier to switch databases than to switch from XML to a database).
No, it won't scale. It's not feasible.
You'd be better off using e.g. SQLite. You don't need a server, it's bundled in with PHP by default and stores data in regular files.
I would go with SQLite instead, which is perfect for small websites and x-copy style deployments.
XML-based data storage won't scale well.
"SQLite is an ACID-compliant embedded relational database management system contained in a relatively small (~225 kB) C programming library. The source code for SQLite is in the public domain.
Unlike client-server database management systems, the SQLite engine is not a standalone process with which the program communicates. Instead, the SQLite library is linked in and thus becomes an integral part of the program. It can also be called dynamically. The program uses SQLite's functionality through simple function calls, which reduces latency in database access as function calls within a single process are more efficient than inter-process communication. The entire database (definitions, tables, indices, and the data itself) is stored as a single cross-platform file on a host machine. This simple design is achieved by locking the entire database file at the beginning of a transaction."
Everyone loves to throw dirt on XML files, but in reality it works, I've seen large applications use them, and I know of an MMO that uses simple flatfiles for storage and it works fine( by the way the MMO is among the top 5 worldwide, so it's not just a toy). However my job right now is creating a better and more savy persistence layer based on SQL, and if your site will be big SQL is the best solution but XML is capable of Massive (MMO) scalability if done well.
But a caveat is migration from XML to SQL is rough if the mapping isn't easy.
Related
If I wanted to do large amounts of data fitting using matrices that were too large to fit in memory what tools/libraries would I look into? Specifically, if I was running on data from a website normally using php+mysql how would you suggest making an offline process that could run large matrix operations in a reasonable amount of time?
Possible answers might be like "you should use this language with these distributed matrix algorithm to map reduce on many machines". I imagine that php isn't the best language for this so the flow would be more like some other offline process reads the data from the database, does the learning, and stores back the rules in a format that php can make use of later (since the other parts of the site are built in php).
Not sure if this is the right place to ask this one (would have asked it in the machine learning SE but it never made it out of beta).
There are lots of things that you need to do if you want to process large amounts of data.
One way of processing web scale data is to use Map/Reduce and maybe you can look at Apache Mahout Which is a scalable machine learning package containing
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
And many more.
Specifically what you want to do might be available in some opensource project, such as Weka but you might need to migrate/create code to do a distribute job.
Hope the above gives you an idea.
Machine Learning is a wide field and can be used for many different things (for instance supervised predictive modelling and unsupervised data exploration). Depending on what you want to achieve and on the nature and dimensions of your data, finding scalable algorithms that are both interesting both in terms of the quality of the model they output and the scalability to leverage large training sets and the speed and memory consumption at prediction time is a hard problem that cannot be answered in general. Some algorithm can be scalable because they are online (i.e. learn incrementally without having to load all the dataset at once), other are scalable because they can be divided into subtasks that can be executed in parallel). It all depends on what you are trying to achieve and on which kind of data you collected / annotated in the past.
For instance for text classification, simple linear models like logistic regression with good features (TF-IDF normalization, optionally bi-grams and optionally chi2 feature selection) can scale to very large dataset (millions of documents) without the need for any kind of cluster parallelization on a cluster. Have a look at liblinear and vowpal wabbit for building such scalable classification models.
This is a long and old question that doesn't get to the point. I basically wanted to know practices involving flat files and the extent they could be used, as a replacement for SQL, mostly in terms of multi-user capability.
At the time, I was wanting to replicate a SQL table editor interface with flat files, allowing collaborative editing. Basically like a multiuser Excel, with an automated data-entry interface, and interactive sortable tables.
I also wanted to build a CMS index page for a server, which parsed text files in order to construct a dynamic webpage, which allowed for easy updating/managing.
I've been beginning to learn MySQL, and XML. For dynamic data storage, I prefer XML over MySQL because it doesn't require a server and can be edited within a text-editor, but I'm unsure whether they can be used for similar things.
(I know that MySQL and XML are two completely different things, but I'm looking at this in regards to data storage.)
In the past I've manually stored lists of stuff in *.txt files (to keep track of things), sometimes with multiple fields per-row kinda thing, like a table, or lines with related data. HTML tables are good for this, but it would be even nicer to be able to edit directly in the page without the need for a text editor, and in certain situations, allow multiple persons (collaborative editing) to edit different sections at the same time.
(I want to use PHP to create scripts that can do this - allow editing of files in browser, including collaborative. I want to learn data manipulation methods in general.)
So basically, I want to create an index for whatever scripts and documents I'd want to display, in the form of a Content Management System. I'd want pages to be modular somehow.. Some modules would be a CRUD (create, read, update, delete) with tabular data, another module could be a pastebin-like text dump derived from a PHP script, some sort of article-publishing system for wiki-like linked articles, single articles or blog posts.
Anyway, I've made scripts that parse XML files, and I like the idea of separating content from presentation, but I don't know how/if XML could be incorporated into a CMS (or any dynamically-editable situation), as most popular ones use MySQL. This is only for personal use and not for some big site, and it would be nice for it to be simple and portable, only requiring the Web server. I'd only prefer MySQL as a last resort, as I don't like having to setup MySQL every time I switch servers, or going through MySQL connection errors.
What should I do / Any suggestions?
I prefer XML over MySQL because it doesn't require a server
I prefer to travel on foot rather on wheels because it doesn't require a car. So, I spend 6 hours to get to my job and back every day.
XML can be edited within a text-editor
In theory.
In practice, XML is bound with such a number of strict rules and standards that you scarcely can edit a comma without breaking the whole file.
Face the truth - it is for programs, not humans.
In the past I've manually stored lists of stuff in *.txt files
You'd better stick with this approach further
HTML tables are good for this,
HTML tables are worse for this, even worse than XML.
I want to create an index for whatever scripts and documents I'd want to display, in the form of a Content Management System.
You are taking Content Management Systems wrong. It's Content Management System, not scripts management system. It merely manages the content, the data stored somewhere.
I like the idea of separating content from presentation,
I like it too, but your XML has nothing to do with this idea.
What should I do / Any suggestions?
Learn to drive a car. Do not remain a pedestrian.
Learn databases.
An interesting reading http://www.joelonsoftware.com/articles/fog0000000319.html
So you have two questions: "Can XML be used like MySQL?" (No, it must be treated different, use XPath instead of SQL etc.) and "Can XML be used for building a CMS?" (Yes, there are some like that, e.g. GetSimple CMS - see http://get-simple.info/start/)
You must be aware that XML is suitable only for smaller amounts of data, but in that case you probably don't need the weight of a database.
Update
I have looked into various php frameworks (yii framework looks particularly good/interesting) and they seem to offer some very good features compared to simple template engines (including MVC and other features like database integration). I am certain that some from of separation between code and display is needed. Whether that is just plain php (as a template engine), a template engine (e.g smarty) or a framework probably depends a lot on the application and the company/programmers choice. (and an issue I can continue to research in my own time, and will probably not get definite answers for)
I still have one remaining question. Is it possible to have a multi-tiered setup (as in multiple separate individual servers) where one runs php code ("application logic"), which is outputted in a form such as XML or JSON or any other data interchange format, that then gets sent to the web/HTML server which takes the output from the php code and converts it into HTML so that the total average number of pages served per a second is faster than just using a single tier. (even if that single tier got all the servers from the two separate tiers combined for itself. Also I'm guessing the parsing time for XML (and probably JSON) would decrease it, but a new protocol between the two tiers could be used that is optimized for this purpose.)
I was pondering HTML and code separation (both to implement MVC and allow web designers (as in looks/views) and web developers (as in code/controllers) to work independently) and I thought it should be possible to have application servers that run PHP (Application/Business Logic/Controller) and web servers that take the output from the application servers and insert it into HTML markup (Looks/Views).
In theory it would work a bit like the separation of an application server and a database server, while for a single request it might be slightly slower for that one user due to network overhead you can handle considerably more simultaneous requests with two small servers than one big server. For example the application server could send it's processed (view independent) information to the web server that would then insert that into the HTML (which could be different depending on the client, e.g mobile browsers). It may be possible to cache the HTML in ram but not the dynamic content so that even if the page looks fairly different for every user (e.g Facebook) some speed benefit is still gained compared to a single HTML/PHP combo. It would also, of course, separate the HTML from the application code (PHP).
I looked into this and I came across numerous PHP template engines which would facilitate the separation of HTML from code, however some of them are considerably slower than using just PHP and "compiling" the template doesn't seem to make that much of a difference (and would prevent using separate web/HTML and code/PHP servers). While using PHP itself as a template engine might work ok for me as a single developer it will not work in an environment with separate web designers.
So to summarize I am kind of looking for/to create a combination of a MVC-framework and template engine/system that will facilitate HTML and code separation and view and model/controller separation that will actually be faster than just using a single tier. Or, more accurately, have better scalability than a single tier (although I wouldn't expect much of a speed decrease, if any, for single pages).
Have you looked into one of the gazillion PHP frameworks around? Most of them sport all of this in one form or another. Heck, I've written a few myself, although with a strict XML templating engine to get away from greedy PHP developers. :) (xSiteable if you want to peek at the code)
MVC is nice and all, but there's many ways of skinning this cat, and to be frank, MVC (or any of its many incarnations, mutations and variations) gives separation at the cost of complexity, so you'll certainly not make it faster per se. The only thing I can think of is that your template engine spits out (ie. writes to disk in a cached fashion) pure PHP files based on backend business logic, and then you can use various accelerators with good use. You still have to decide on what the templating environment should be, though. HTML with interspersed notation or PHP, XML, or something else?
A compiler is easy enough to make, but I'm a bit wary. I doubt you'll get much speed improvements (at least not compared to the added complexity) that way over a well cached templating engine, but it's certainly doable.
But why not just use PHP itself, and simple Apache rewrite rules (using 'uri' as parameter)?
$theme = 'default' ;
$dbase = 'mysql; '
$logic = $_REQUEST['uri'] ; // or some other method, like __this__ with starting folder snipped
include 'themes/top.html' ;
include 'logic/'.$logic.'/header.php' ;
include 'themes/mid.html' ;
include 'logic/'.$logic.'/menu.php' ;
include 'themes/section1.html' ;
include 'logic/'.$logic.'/section1.php' ;
include 'themes/section2.html' ;
include 'logic/'.$logic.'/section2.php' ;
include 'themes/section3.html' ;
include 'logic/'.$logic.'/section3.php' ;
include 'themes/bottom.html' ;
include 'logic/'.$logic.'/footer.php' ;
include 'themes/end.html' ;
It's brute, fast, and does provide what you want, although it's not elegant nor pretty nor, uh, recommended. :)
I'm trying to develop a website in which many recipes are stored, and retrieved for the clients. I had some courses about XML and native XML-based databases, and those courses introduced the concept of native XML databases. Besides, if I remember correctly, we learned that XQuery is the most suitable programming language for working with XML. Because of the semi-structure and not so tabular nature of a recipe, I guess(please correct me if I'm wrong) that it can be best expressed in an XML file, like below :
<recipe>
<ingredients>
<name='floor' amount='500g'/>
<name='y' amount='200g'/>
</ingredients>
<steps>
<step id='1'> first prepare .....
<steps>
</recipe>
I know that relational databases have their advantages and glories over other options, however it would result in so many join operations on tables in this particular case. On the other hand, native XML databases don't seem very promising to me, regarding their performance and abilities to handle a large amount of data. Besides, programming in PHP is much more simpler than XQuery, considering the huge volume of tutorials and helps on internet.
I really don't know what to do, and that's why I came to you guys.
some simple determination theory without looking any strong requirements or something.
first where is your data-source gonna be.
if your data is being generated through a user input screens.
if your data is well validated and processed by a single application( e.g) web ).
if your data properties and features are pretty much freezed and no new dimension to it.
if your data is of transaction nature.
then you can think of relation - db.
if your data is coming from different datasources like flat file, xml, internet screen scraping , etc etc.
comparatively less amount transaction
data properties are fluid and can have various slice / dimension to it.
ready to work with functional languages like XQuery or Xmllized language like XSLT
then Xml Database is the key.
Use relations DB - because it is much more faster if you get bigger amout of records , and it is simplier to create.
( for your example it is 3 tables - one with recipes, another with ingredients and the last one with steps. Alternative is to create table with all known ingredients and use association - eg. table with ID of recipe, ID of ingredient and amount )
It seems that you're thinking that you have to choose one or the other here. That isn't the case, XQuery isn't really setup to be a complete web scripting environment, it's a replacement to SQL not PHP. Therefore you can certainly use PHP to do the web focused parts of the site such as user logins (which could also be in a relational DB) then use XQUery just for your recipe querying layer.
Some XML databases such as MarkLogic can also do all the web logic side of the equation but they don't offer the same richness of libraries yet, so I would certainly recommend PHP or something like that for the web tier.
I have a custom PHP framework and am discovering the joys of storing configuration settings as human-readable, simple XML.
I am very tempted to introduce XML (really simple XML, like this)
<mainmenu>
<button action="back" label="Back"/>
<button action="new" label="New item"/>
</mainmenu>
in a number of places in the framework. It is just so much more fun to read, maintain and extend. Until now, configuration was done using
name=value;name=value;name=value;
pairs, arrays, collections of variables, and so on.
However, I fear for the framework's performance. While the individual XML operations amount to next to nothing in profiling, they are of course more expensive than a simple explode() on a big string. I feel uncomfortable with simpleXML (my library of choice) doing a full well-formedness check on a dozen of XML chunks every time the user loads a page.
I could cache the XML objects myself but would like to 1.) avoid the hassle and 2.) not add to the framework's complexity.
I am therefore looking for an unobtrusive XML "caching" solution. The perfect thing would be a function that I can give a file path, and returns a parsed simpleXML object. Internally, it maintains a cache somewhere with serialized simpleXML objects or whatever.
Does anybody know tools to do this?
No extension-dependent solutions please, as the framework is designed to run in shared webhosting environments (which is the reason why performance matters).
You could transform the XML into your former format once and then check for modification time of the XML and the text file via filemtime. If XML is newer than the textfile, then do the transformation again.
This would increase complexity in a way, but on the other hand would help you reuse your existing code. Of course, caching is another viable option.
Hi there i'm useing the Zend Cache for those kind of things and must say it's very fast got one page from about 2secs to 0.5secs down.