I'm working on an information warehousing site for HIV prevention. Lots of collaborators will be posting articles via a tinyMCE GUI.
The graphic designers, of course, want control over page lengths. They would like automatic pagination based on the height of content in the page.
Anyone seen AJAX code to manage this?
Barring that anyone seen PHP code that can do a character count and a look-behind regex to avoid splitting words or tags?
Any links much appreciated!
If it doesn't need to be exact there's no reason you can't use a simple word count function to determine an appropriate place to break the page (at the nearest paragraph I suppose). You could go so far as to reduce the words per page based on whether there are images in the post, even taking the size of the images into account.
That could get ugly fast though, I think the best way to do it is to allow them to manually set the page dividers with a tag in the article that you can parse out. Something like [pagebreak] is pretty straightforward and you'll get much more logical and readable page breaks than any automated solution would achieve.
You don't just have to worry about character count, you also have to worry about image heights if there are images or any other kind of embedded objects in your pages that can take up height. Character count will also not give you an idea of paragraph structure (a single long paragraph with more characters than a page with many paragraphs might be shorter).
If you're willing to use JavaScript, that might be the ideal solution, post the entire article to the client and let JavaScript handle the pagination. From the client you can detect image and object heights. You could use PHP to place markers about where you think the pages should be, and then use JavaScript to make it happen. Unless the pages are very long I don't think you'll need to do several xmlHttpRequests (AJAX).
For just a straight PHP solution is also simple, but probably not ideal as you're not dealing with a matter of managing row counts. You could use a GET variable to determine where you are in the page.
Although this might not be the exact answer you're looking for, but you should really make sure your site doesn't have a fixed height. Flexible width's are really nice, but not as critical as the height.
Especially for a cause like this, and a content-heavy site; it's fair to require flexible heights.
As mentioned by apphacker, you can't really detect the height from within PHP and you're kind of stuck with javascript. If you're absolutely stuck with paging, it's probably better to let your content authors decide when to break off the page, so you break it on a real section, instead of mid-word, sentence, etc.
Edit: usability should dictate design, not the other way around. You're doing it wrong ;)
A good pagination is not a simple task. That's not a simple matter of coding. Scientific research by Plass (1981) proved that the optimal page breaking is in general NP-hard.
You should worry about floating figures, line breaks, different font styles,etc.
And the only thing an HTML engine can help you is parsing a page to a DOM tree. What about sizes? Yes you could have font width and font height, margins and paddings, picture sizes. But that's all. All the layout is on your shoulders. And doing it in javascript... meh...
So the only feasible solution of automatic fixed height pagination would be a server-side. PrinceXML is currently the best HTML2PDF converter. But it costs a lot.
If you are good with different page heights, you could use epalla's suggestion. But this is also not as simple as it seems.
Some references for pagination:
Optimal pagination techniques, Plass, 1981
On the Pagination of Complex Documents, 1998
Pagination reconsidered
Knuth's Digital Typography
Related
I am familiar with scraping and using XPATH in php to parse the DOM to get what i want from a page. What i would like to hear are some suggestions on how i could programatically ignore the header, footer and sidebars on a page, and only extract the main body content.
Situation given is that there is no specific target, so i cannot simply ignore specific id's like #header and #footer, because every page is written slightly differently.
I know that google does this, i know it must be possible, i just don't really know where to start with it.
Thanks!
There is no definite way to determine it but you can get reasonable results with heuristic methods. A suggestion:
Scrape two or more pages from the same website and start comparing them block by block starting on the top level, going a few levels deep until the blocks are sufficient equal. The comparison would not be == but a similarity index, for example with similar_text.
Blocks above a certain percentage of similarity will most likely be header, footer or menu. You will have to find out by experiment which threshold is useful
There is no small or quick way to scrape content from webpage. I have done a lot of these. There is not simple rule about this. Earlier in the html3/table based design days, there was different way to identify and site design itself was limited. screen size was limited so often menu was on top side and no space for right or left panels. then came era with panels with table designs. now is the time with floating content. And then we even use overflow:hidden so its even more hard to know body by word count, etc.
When writing html file, the code is never tagged as a content or menu. You can sometimes derive that from class names but that is not universal. the content gets its size and positions from CSS. so your parser alone can never determine the body part of the page. If you use a embedded html viewer and use DHTML/JS to locate sizes of blocks after rendering, there might be some way to do it but still it will never be universal. My suggestion is to make your parser and improve it case by case.
For google, it has made programs for most combinations of html designs. But even for google, making a universal parser, i think is impossible.
So essentially I have blocks of text on a page, essentially just about 10 same sizes boxes of text using CSS. I want to be able to order these though - and so drag and drop one. I can code the backend ordering myself - can anyone recommend where to go for the fronnt-end drag and drop? I'm aware that jquery would probably be my best bet, yet I've never used javascript so if there's any sort of code already created for this then that'd be incredibly helpful.
Thanks!
You mean this? http://jqueryui.com/draggable/
Here's jQueryUI's draggable and droppable:
http://jqueryui.com/draggable/
http://jqueryui.com/droppable/
They're kind kind of cute but might well not do what you want.
Here's a pretty straightforward explanation of how to implement drag and drop the HTML5 way:
http://www.w3schools.com/html/html5_draganddrop.asp
(It mostly convinced me to implement my own rather than wait for better HTML5 drag and drop support.)
Here's an example for older browsers:
http://luke.breuer.com/tutorial/javascript-drag-and-drop-tutorial.aspx
There's also draggable.js (and its jQuery port) neither of which I recommend.
Stuff these examples don't really do includes pretty much anything useful. E.g. what if you want to target PARTS of an object or the spaces BETWEEN objects? How can you make targets dynamics based on what's going on in the page. It all gets very ugly very fast.
Sorry not to have a simpler answer.
I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...
You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.
I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.
Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.
I'm building an PHP email mailbox script.
How would I make html emails display cleanly as they do in gmail/hotmail.
If I just echo it out it affects the whole page layout.
I could use iframes but surely that isn't the best solution.
If you are looking for the 'best solution' get on board with another open source email library that is doing the same thing you are. Maintaining an email renderer on your own that is safe against script injection and other hacks will simply be too much work for one person.
One example: https://github.com/afterlogic/webmail-lite
Another: http://trac.roundcube.net/
You get the benefit of other developers who use the library maintaining the code base, so if something is broken, all you have to do is pull the latest update (hopefully) and you get the fix. If you find something that needs improving, you can fix it or build it, and make the code better for everyone. I'm really just pitching open source libraries here, however in any commercial context, building your own email renderer without a big team, is a bad idea.
As Marc B stated, I believe an IFrame would be your best bet... but please realize that if you just dump any email HTML code you risk exposing yourself to viruses, Trojans, and malicious HTML/JavaScript code - Your opening Pandora's box on your computer unless you find a good way to sandbox/strip that HTML.
Here's a simple Regex to clean JavaScript at least :
"(?s)<script.*?(/>|</script>)"
Consider the use of some HTML Tidy library (i.e.: PHP.Tidy).
You can pass the text through the library to get well formatted html.
A good practice would be to define a CSS standard behaviour for most tags in the div you're using.
Create a DIV container that you assign width (and height if needed) to, and make sure you add an overflow property to match your design. This should keep your email HTML from interfering with your layout.
UPDATE
A DIV container still assures you that you can constrain the size of the display box and with appropriate CSS acts similar to an iframe without all the baggage.
If you are worried about the code in the email, strip_tags would seem a better solution than the regex. You can define a list of tags to leave alone and still be confident of stripping the rest.
I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.
You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.
you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki
Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent
You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.
this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.