Clean HTML from Word documents

Clean HTML from Word documents - php

Ok, so my company has a client that has an interface for posting content - standard MySQL database, PHP-based, etc.
Anyway, they've continually had an intern or someone, post content to this interface straight from an MS Word doc - the interface is coded poorly, and takes this input as is, with no formatting.
My company has now been contracted out to fix this particular problem, as it is continually breaking their site, and my company has repeatedly had to manually go into the database, and delete the offending values.
Is there a quick and easy way to do this, or am I going to have to just do a replace operation on each offending character?
I see htmlentities() may be a partial solution - but as far as I know, that won't remove everything.
What's a good solution to this problem? Is there anything out there to make this easier?
We're also considering writing a content validator as well, probably just server-side (though maybe client-side, if my week is going slowly enough/I finish the rest of this quickly enough).

It depends on how many clients (or potential clients) you are supporting and how much time you have to invest. Options
Write your own function to strip out the metadata
Teach your clients to remove it themselves such as paste in notepad first,
or supply a knowledge base article to explain how to do it in the software. Perhaps a "Help" section or icon they can click on.
htttp://support.microsoft.com/default.aspx?scid=kb;en-us;223396
Use a WYSIWYG editor such as TinyMCE which has built in functionality to remove it
But like I said in the comments, unless you are using your own function, prepare for clients to continue to paste directly and wonder why there is a problem.

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

TinyMCE, PHP and MySQL: security and escaping questions

I'm implementing TinyMCE for a client so they can edit front-end content via a simple, familiar interface in their site's admin panel.
I have never used TinyMCE before but notice that you are able to insert whatever markup you want and it will be happily saved off to the MySQL database, assuming you don't escape the contents of the TinyMCE before running it through your query.
You can even insert single quotes and have it break your SQL query entirely.
But of course, when I do escape the contents, benign presentational stuff like paragraph tags get converted to HTML entities and so the whole point of the WYSIWYG editor is defeated, because the entities are spat back out when it comes to displaying the stored content on the front-end.
So is there a way I can "selectively escape" content from TinyMCE, to keep the innocent tags like P and BR but get rid of dangerous ones like SCRIPT, IFRAME, etc.? I really don't want to have to manually encode and decode them using str_replace() or whatever, but I'd rather not give my client a gaping security hole either.
Thanks.

Have you tried htmlpurifier? works wonders. Its caveats; big and slow, but the best you can have.
http://htmlpurifier.org .

Sorry Dude, I'd say this a question for the authors of TinyMCE, so I suggest you ask at: http://tinymce.moxiecode.com/enterprise/support.php ... I'm sure they'll be only to happy to answer (for a small fee), and I suspect this may even be one of there FAQ's.
It's just that I'd guess you'd be very lucky if you hit another TinyMCE-user (let alone an authorative one) on stack-overflow, a "general programming forum"... although I notice there are currently 837 questions tagged "tinymce" on this forum; have you tried searching through them? Maybe there's a pointer in one of those?
Cheers. Keith.
EDIT: Yep, Making user-made HTML templates safe is more or less the same question posed in different words, and it has (what looks to ignorant me) a couple of answers which posit practical solutions. I just searched stack overflow for "Tiny MCE html security".

That's like complaining that you can write naughty words in Microsoft Word, and that Word should filter them for you. Or complain to GM that they build cars that then get used as escape vehicles in bank robberies. TinyMCE's job is to be an online editor, not to be the content police.
If you need to ban certain tags, then remove them when the document's submitted by using strip_tags(). Or better yet, HTMLpurifier for a more bullet-proof sanitization. If embedded quotes are breaking your SQL, then why weren't you passing the submitted document through mysql_real_escape_string() or using PDO prepared queries first? MCE has no idea what the server-side handling is going to be, nor should it care at all. It's up to you to decide how to handle the data, because only you know what its ultimate purpose is going to be.
In any case, remember that all those editors work on the client side. You can make TinyMCE as bulletproof and as strict an editor as you want, but it's still running on the client. Nothing says a malicious user can't bypass it entirely and submit all the embedded quotes and bad tags they want. The ultimate responsibility for cleaning the data HAS to fall on your code running on the server, as it's the last line of defense, and the only one that can ensure the database remains pristine. Anything else is lipstick on a pig.

A client wants me to do CSS coding (only) but doesn't want to provide me the php files

I have a client who wants me to do CSS coding only, but doesn't want to give me the php files.
Right now, I just have access to the live website (with no CSS).
It is entirely made with tables and I want to use divs instead
I'm not sure if it is possible to do the coding
I thought about copying and pasting the generated HTML code from each page
Will this cause possible problems with the end result?

Yes, this will cause huge problems: you'll do an awesome job, client will have trouble integrating it with their site, client will abandon your awesome work.
IMO, you should let the client know that you'll do the best you can with what they have given you, but you would be able to save them a lot of work and do a better job if you could have access to the source code.
If you know that you can't make the client happy with what they have given you, though, it would be doing everyone a disservice for you to try.

If you absolutely can't convince them to give you access to the source, then this client sounds stupid:
He has a layout which is table based.
He wants you to magically make it look better with CSS, without having access to the source.
"#Phoenix I don't see any classes or IDs." - there are no classes or ids to hook into.
You might be able to do it if you used some CSS3 selectors to, for example, select the 3rd td inside a td inside the 2nd table to apply styles to ;)
But, that won't help if you have to support older browsers, which makes this impossible at the moment without doing something differently.
I don't have full knowledge of your situation, but here's what I would probably do (if I couldn't convince them to give me access to the source):
Open the live site.
Copy the HTML source code.
Paste it into a new local file.
Add this into the <head> section: <base href="http://the-clients-site.com/" />.
This will let all the assets on the page load from the client's actual site.
Now, you have something to work with.
You have to keep track of ALL changes you make to the file.
The first change should be adding your own blank style tag.
Then, you can add id and class to whichever elements you feel need it.
You should try to avoid moving around elements, unless it's absolutely required. Those changes are a whole lot harder to explain to someone. I know from experience.
You should be able to style the page properly now.
Then, you deliver the completed page, and the documented list of changes you had to make to the HTML (add id, here add class there).
The client should then be able to integrate the changes into his site.

Well, at a bare minimum they'll need to modify ther PHP to reference your CSS. More importantly, you need to be able to hook your CS up to elements - Do tables/rows/etc. have Ids or classes attached?
If they are clever and have some good separation between code and presentation (using a templating engine or similar) then you can probably just edit the template / css.
If they won't let you edit the PHP and you come up with a new awesome layout, they will have a nightmare job trying to integrate it and probably won't bother.

I don't see the problem. You can style tables just as easily as divs. You don't have to know how the wall is built to know how to paint it, which is pretty much all you've been hired to do. Only problem I could see would be if they haven't added any classes or ids to the elements yet. After all, what the browser/client sees is the only thing that needs styling, and since you can see everything that the browser sees, you can see everything that needs styling.
If they have added classes/ids, then just take a copy of a page and style it in a testing area, and then once it looks nice, you take a copy of another page and make sure it looks nice with it too, add to the CSS if there are any new unstyled elements that didn't exist on the first page, once it looks nice, then move on to another page, and another repeating the process until you are satisfied that it appears that every page within reason would look nice with it.
If they haven't added classes/ids, tell them they need to in some capacity before you can work on it, perhaps provide some guidance on the issue.

I'm actually doing this right now for SO.
I'm working on a userscript that provides an alternate "clean" stylesheet for the StackExchange network. I have no access to the SO engine. I am using the Chrome Inspector to look at how the elements are set up. I recommend the same. (Although it is a little different, since I'm modifying the original CSS file.)
You can easily identify what you want to style with the Inspector and then work from there. I would suggest that you ask your client for a list of classes and IDs though. (I got that in the form of an existing stylesheet, you can go about it in a different way, if that suits you and your client.)

Getting to know a new web-system that you have to work on/extend

I am going to start working on a website that has already been built by someone else.
The main script was bought and then adjusted by the lead programmer. The lead has left and I am the only programmer.
Never met the lead and there are no papers, documentation or comments in the code to help me out, also there are many functions with single letter names. There are also parts of the code that are all compressed in one line (like where there should be 200 lines there is one).
There are a few hundred files.
My questions are:
Does anyone have any advice on how to understand this system?
Has anyone had any similar experiences?
Does anyone have a quick way of decompressing the lines?
Please help me out here. This is my first big break and I really want this to work out well.
Thanks
EDIT:
On regards to the question:
- Does anyone have a quick way of decompressing the lines?
I just used notepad++ (extended replace) and netbeans (the format option) to change a file from 1696 lines to 5584!!
This is going to be a loooonnngggg project

For reformatting the source, try this online pretty-printer: http://www.prettyprinter.de/
For understanding the HTML and CSS, use Firebug.
For understanding the PHP code, step through it in a debugger. (I can't personally recommend a PHP debugger, but I've heard good things about Komodo.)
Start by checking the whole thing into source control, if you haven't already, and then as you work out what the various functions and variables do, rename them to something sensible and check in your changes.
If you can cobble together some rough regression tests (eg. with Selenium) before you start then you can be reasonably sure you aren't breaking anything as you go.

Ouch! I feel your pain!
A few things to get started:
If you're not using source control, don't do anything else until you get that set up. As you hack away at the files, you need to be able to revert to previous, presumably-working versions. Which source-control system you use isn't as important as using one. Subversion is easy and widely used.
Get an editor with a good PHP syntax highlighter and code folder. Which one is largely down to platform and personal taste; I like JEdit and Notepad++. These will help you navigate the code within a page. JEdit's folder is the best around. Notepad++ has a cool feature that when you highlight a word it highlights the other occurrences in the same file, so you can easily see e.g. where a tag begins, or where a variable is used.
Unwind those long lines by search-and-replace ';' with ';\n' -- at least you'll get every statement on a line of its own. The pretty-printer mentioned above will do the same plus indent. But I find that going in and indenting the code manually is a nice way to start to get familiar with it.
Analyze the website's major use cases and trace each one. If you're a front-end guy, this might be easier if you start from the front-end and work your way back to the DB; if you're a back-end guy, start with the DB and see what talks to it, and then how that's used to render pages -- either way works. Use FireBug in Firefox to inspect e.g. forms to see what names the fields take and what page they post to. Look at the PHP page to see what happens next. Use some echo() statements to print out the values of variables at various places. Finally, crack open the DB and get familiar with its schema.
Lather, rinse, repeat.
Good luck!

Could you get a copy of the original script version which was bought? It might be that that is documented. You could then use a comparison tool like Beyond Compare in order to extract any modifications that have been made.

If the functions names are only one letter it could be that the code is encoded with some kind of tool (I think Zend had a tool like that - Zend Encoder?) so that people cannot copy it. You should try to find an unencoded version, if there is one because that would save a lot of time.

How should I store textual data that won't change very often?

As an exercise in web design and development, I am building my website from the ground up, using PHP, MySQL, JavaScript and no frameworks. So far, I've been following a model-view-controller design. However, there is one hurdle that I am quickly approaching that I'm not sure how I'm going to solve, but I'm sure it's been addressed before with varying degrees of success.
On my website, I'm going to have a resume and an "about me" bio section. These probably won't be changing very often.
For my resume, I think that XML that can be rendered into HTML (or any other format) is the best option, and in that case, I could even build a "resume manager" using PHP that can edit the underlying XML. A resume also seems like it could be built on top of MySQL, as well, and generated into XML or HTML or whatever output format I choose.
However, I'm not sure how to store my about me/bio. My initial idea was a plain text document that can be read it, parsed, and the line breaks converted to paragraphs. However, I'm not sold on that being the best idea. My other idea was using MySQL, but I think that might be overkill for a single page. What I do know, however
What techniques have you used when storing text for a page that will not change very often? How did they work out for you - what problems or successes did you have?

Like McWafflestix said, use HTML, if you want to output HTML. Simplest case within PHP:
<?php
create_header_stuff();
include('static_about.html');
create_footer_stuff();
?>
and in static_about.html something like
<div id="about">
...
</div>
Cheers,

Just use a static page, if the information won't change very often. Just using static HTML gives you more control over the display format.

Generally treating infrequently changing information the same as frequently changing information works well if you add one other component: caching.
Whatever solution you decide on for the back end, store the output in a cache and then check to see if the data has changed. Version numbers or modified dates work well here. If it hasn't changed, just give the cached data. If it has changed then you rebuild the content, cache it and display.
As far as structure goes, I tend to use text blobs in a database if there is any risk that there will be more dynamic databases. XML is a great protocol for communicating between services and as an intermediate step, but I tend to use a database under all my projects because eventually I end up using it for other things anyway.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.