I am in the process of developing an html template generator. I would like to store the templates in a mysql database (or text files if that would be more efficient). I am looking for the best way to serialize the html and css, and then reproduce the original efficiently.
Javascript and php will be used to create/edit/and remove elements from both the browser and the database for later reproduction.
Elements can be added such as div, p, a, etc. and nesting should not be an issue. The main problem I am having is if a div is nested within a div and a paragraph element is nested within the second div at a later time, how would all of this be stored in a database? Elements will be deleted and new elements will be added in random order.
I hope this was somewhat clear. I am not really looking for code, but suggestions on how all of this would work together. Any help would be greatly appreciated. Thanks in advance
Store the entire thing as a blob in a single field in the database. Unless you're doing something quite novel, there's no reason the database layer needs to have an in-depth understanding of DOM structure.
I want to first point out that, what you need is a parser not a serialising method. You can simply store the template, then when you read it out, parse each individual element and then build the editing form. Sort of like xtgem does. No need to serialise. Unless you already parsed the data in an array.
If each user has his own template, then the chances of multiple reads is very low. Almost nill. Flat files will do very well in this siiuation. No need to sanitise and better performance. Do not forget file locking though.
If you are returning data in text input or textarea elements, no problem with xss. You don't need to sanitise output either.
I agree with #Rex. When I was developing a forum, I found that storing the original was better. It allowed for easy editing.
It would be best to separate the files and parse them separately. Keep presentation separate from markup.
Parsers have never been my forte however. So I can't help in that aspect.
You can take a look at pcltemplate from http://phpconcept.net to see how the parser works. Maybe give you ideas.
Related
I am pretty new to coding and am still learning the ropes. I got an XML feed that I'm processing where, amongst other common data that is easy to process, I can get complex elements such as :
<para>Text before the Link more text after the link.</para>.
I get the data out of XML by simplexml_load_string($xml), perhapes this is where I'm going wrong. What is the best way to process an XML file where there may or may not be 'inline child' elements so that the output looks as it should. I can get the data out but what I end up with is two entries, "Text before the Link more text after the link." and the link itself.
This seems simple and I'm sure others may have run into this but I'm not sure how to preserve the original markup. I have been trying to find an answer to this with the good old google but I must be not using the right terms. Any help or pointers to a resource would be much appreciated. I have gone through a few tutorials online, but I'm unsure what I'm doing wrong.
Thanks.
It seems that SimpleXml can work on simple xmls ;)
Here may be solution for you Use DOMDocument
I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">
<!-- HTML of output from cURL, including doctype declarations and <html>,<head> tags -->
</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.
If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.
I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.
I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...
You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.
I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.
Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.
I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.
As an exercise in web design and development, I am building my website from the ground up, using PHP, MySQL, JavaScript and no frameworks. So far, I've been following a model-view-controller design. However, there is one hurdle that I am quickly approaching that I'm not sure how I'm going to solve, but I'm sure it's been addressed before with varying degrees of success.
On my website, I'm going to have a resume and an "about me" bio section. These probably won't be changing very often.
For my resume, I think that XML that can be rendered into HTML (or any other format) is the best option, and in that case, I could even build a "resume manager" using PHP that can edit the underlying XML. A resume also seems like it could be built on top of MySQL, as well, and generated into XML or HTML or whatever output format I choose.
However, I'm not sure how to store my about me/bio. My initial idea was a plain text document that can be read it, parsed, and the line breaks converted to paragraphs. However, I'm not sold on that being the best idea. My other idea was using MySQL, but I think that might be overkill for a single page. What I do know, however
What techniques have you used when storing text for a page that will not change very often? How did they work out for you - what problems or successes did you have?
Like McWafflestix said, use HTML, if you want to output HTML. Simplest case within PHP:
<?php
create_header_stuff();
include('static_about.html');
create_footer_stuff();
?>
and in static_about.html something like
<div id="about">
...
</div>
Cheers,
Just use a static page, if the information won't change very often. Just using static HTML gives you more control over the display format.
Generally treating infrequently changing information the same as frequently changing information works well if you add one other component: caching.
Whatever solution you decide on for the back end, store the output in a cache and then check to see if the data has changed. Version numbers or modified dates work well here. If it hasn't changed, just give the cached data. If it has changed then you rebuild the content, cache it and display.
As far as structure goes, I tend to use text blobs in a database if there is any risk that there will be more dynamic databases. XML is a great protocol for communicating between services and as an intermediate step, but I tend to use a database under all my projects because eventually I end up using it for other things anyway.