Formatting HTML correctly from cURL requests

Formatting HTML correctly from cURL requests - php

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">

</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.

If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.

I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

Related

How do I handle the situation when I allow safe external HTML elements but they are not valid, thus still breaking my webpage?

I fetch HTML blobs from external sources, for example RSS feeds, and display in my little localhost web control panel.
For security, I use strip_tags on this content and only allow a small number of "safe" HTML elements, such as <p> and <div>.
Up until today, it has worked as expected.
But today, I got such a HTML blob which had an error: it contained a <div> but not </div>, meaning that it ended up messing with my own webpage.
Simply disallowing <div> is not a solution as it would make the content a mess, and of course they could do the same thing with <p> as well or any other allowed element.
I do have a means to validate HTML strings locally, but it takes like two seconds each time, so it's going to add a lot of extra slow-downs. However, if no other solution can be found, I will do this. (That is, if it's not valid, I just strip all HTML elements and let it become a mess/blob rather than breaking my webpage.)
I should of course have predicted that sooner or later, some external source from where I fetch data would have broken HTML. It's a wonder it didn't happen earlier. Still, there is no obvious solution which preserves some kind of structure/markup and is guaranteed to not mess with my page.
I do use an iframe for displaying HTML e-mails, but there are problems with that approach and it basically makes the page annoying to navigate and deal with, so I try to avoid that "solution" for this.

Trying to figure out how to "hide" images and source code

I'm in the process of (slowly) learning how to make my websites more secure. I was checking out D&D Beyond, and noticed a few things I've never seen before, and I would like to learn more about.
Portions of the source code don't show up when you View the Source.
It's hard to explain. I tried to explain it in a different post, and I got a ton of snarky remarks. I'm telling you, I know what I saw. I would like to know how this is possible and how I can replicate it.
I typically write in PHP/JQuery, so I'd primarily like to learn more using those languages.
Example:
You can create a Character using their Character Builder, then view your Character Sheet. The main portion of your character's stats are enclosed in a very large parent div: ".character_sheet"
If you MANUALLY save your Character Sheet to your Desktop, you can see the HTML for this section. If you inspect this section in Firefox, you can also see the data. However, if you try to CTRL+U while in the browser, the HTML in this section does not appear. It also will not appear if you try to curl/fopen/file_get_contents
Additionally, images are not visible by normal means.
For Example: I am aware of how to disable right-clicking on a website, but if someone wanted to take my images, all they'd have to do is open my source code and look at the image url and save it from there.
On the D&D Beyond site, I can bring up Firefox's web inspector where an Image SHOULD be, take a look at the CSS, and... nothing. No link to an image, where one should be. I don't know how they're getting images to appear without css/html. I'd be very interested to know how this is done.
If anyone has any insight/guesses/etc and can point me in the right direction to learn some more, I'd really appreciate it!

Server-side code such as PHP is always hidden to visitors (unless you have a security vulnerability of some sort).
Client-side code such as HTML, JavaScript and CSS is always visible to the visitor. Even if you can't see it immediately in the DOM, it will be hiding there somewhere.
The most likely scenario is that it is hidden within an embedded .js or .css file, which would look similar to the following:
<script src="scripts.js"></script>
<link rel="stylesheet" type="text/css" href="theme.css">
HTML can be outputted to the page through JavaScript, which will not show in the DOM (though it would show up with a PHP echo). HTML can also be 'hidden' through use of <iframe> tags and HTML imports.
JavaScript has a wide array of ways in which it can be obscured / malformed, so it can be hard to track down. You may some some strange, 'unreadable' code in the DOM / .js files, which in turn could be outputting the HTML itself.

Please consider the below points,
All client side resources are viewable although you can make it easyless readable by javascript and it's better to do most of your codes by server side.
You need to know about what search engines love if your app is a public web site & will be indexed by those search engines, as some search engines don't scrape to the web pages which have only JavaScript code.
You can create images without <img> tags using CSS background-image Property.
there are some useful lib's to make your code more hard readable like Closure Compiler Service & JSFuck & JS Packers although it's better to make it by yourself and just add like those techniques to your knowledge, noting that this will make your code size larger.
and at all there are no white page source, it should contains at least <script> and if you saw a real white page it may be disabled from sever side to be viewable at top of window and it may be works if embedded in iframe or by sending specific headers to it or whatever else.
You can make your server & client sides cooperate :) to get great result and more secured.

Render HTML blocks in page (PHP)

I have a database of emails I've collected from our gmail account which I'm trying to render out to an internal page.
This is working, however the occasional email comes in that causes problems because of missing/not closed tags. There might be some CSS thrown in there that I don't want rendered on the whole page.
I could use iFrames, but they seem outdated, and just not the right approach.
What would the suggested method be to render blocks of HTML from the database, but without them effecting the rest of the page?

Firstly, you need to load and interpret that HTML into something not broken. To do this, you use a DOM parser. http://php.net/manual/en/domdocument.loadhtml.php
I could use iFrames, but they seem outdated, and just not the right approach.
No, this is wrong. Until we have good shadow DOM support (and likely even after), iframes are the right way to isolate something in its own context. Make sure you use the sandboxproperty.
Note that you could do this without iframes, but it's going to be a lot more work.
But curious how something like 'Google' can render it without affecting the whole page.
Google doesn't just accept anything that comes through that e-mail, and neither should you, even if you use the iframe method. You need to make a whitelist of what elements you will support, and filter out everything outside of that. Next, you need to figure out what CSS properties you will support. Finally, you need to transform that whole DOM document into something useful and output it as HTML. Check out HTML Purifier for your whitelisting.
None of this is an easy task. You're stepping into an awful lot of hassle. There is no real standard for HTML e-mail. Each provider and mail client has a different set of what they support, and with varying results.

For the CSS part, you may add for example an id for the div u have email content on then add the id as father to all the css selectors! like this
And you may check the tags to be closed correctly which both of these ideas may take some processing power but not much, And you can prevent recalculation by preproccessing and storing the result.

iFrames I wouldn't say are dated and I would consider them a valid output in your case.
But curious how something like 'Google' can render it without affecting the whole page.
If you inspect, they use iFrames.
100% agree with #brad with running it through a parser, then output it to an iFrame.

A client wants me to do CSS coding (only) but doesn't want to provide me the php files

I have a client who wants me to do CSS coding only, but doesn't want to give me the php files.
Right now, I just have access to the live website (with no CSS).
It is entirely made with tables and I want to use divs instead
I'm not sure if it is possible to do the coding
I thought about copying and pasting the generated HTML code from each page
Will this cause possible problems with the end result?

Yes, this will cause huge problems: you'll do an awesome job, client will have trouble integrating it with their site, client will abandon your awesome work.
IMO, you should let the client know that you'll do the best you can with what they have given you, but you would be able to save them a lot of work and do a better job if you could have access to the source code.
If you know that you can't make the client happy with what they have given you, though, it would be doing everyone a disservice for you to try.

If you absolutely can't convince them to give you access to the source, then this client sounds stupid:
He has a layout which is table based.
He wants you to magically make it look better with CSS, without having access to the source.
"#Phoenix I don't see any classes or IDs." - there are no classes or ids to hook into.
You might be able to do it if you used some CSS3 selectors to, for example, select the 3rd td inside a td inside the 2nd table to apply styles to ;)
But, that won't help if you have to support older browsers, which makes this impossible at the moment without doing something differently.
I don't have full knowledge of your situation, but here's what I would probably do (if I couldn't convince them to give me access to the source):
Open the live site.
Copy the HTML source code.
Paste it into a new local file.
Add this into the <head> section: <base href="http://the-clients-site.com/" />.
This will let all the assets on the page load from the client's actual site.
Now, you have something to work with.
You have to keep track of ALL changes you make to the file.
The first change should be adding your own blank style tag.
Then, you can add id and class to whichever elements you feel need it.
You should try to avoid moving around elements, unless it's absolutely required. Those changes are a whole lot harder to explain to someone. I know from experience.
You should be able to style the page properly now.
Then, you deliver the completed page, and the documented list of changes you had to make to the HTML (add id, here add class there).
The client should then be able to integrate the changes into his site.

Well, at a bare minimum they'll need to modify ther PHP to reference your CSS. More importantly, you need to be able to hook your CS up to elements - Do tables/rows/etc. have Ids or classes attached?
If they are clever and have some good separation between code and presentation (using a templating engine or similar) then you can probably just edit the template / css.
If they won't let you edit the PHP and you come up with a new awesome layout, they will have a nightmare job trying to integrate it and probably won't bother.

I don't see the problem. You can style tables just as easily as divs. You don't have to know how the wall is built to know how to paint it, which is pretty much all you've been hired to do. Only problem I could see would be if they haven't added any classes or ids to the elements yet. After all, what the browser/client sees is the only thing that needs styling, and since you can see everything that the browser sees, you can see everything that needs styling.
If they have added classes/ids, then just take a copy of a page and style it in a testing area, and then once it looks nice, you take a copy of another page and make sure it looks nice with it too, add to the CSS if there are any new unstyled elements that didn't exist on the first page, once it looks nice, then move on to another page, and another repeating the process until you are satisfied that it appears that every page within reason would look nice with it.
If they haven't added classes/ids, tell them they need to in some capacity before you can work on it, perhaps provide some guidance on the issue.

I'm actually doing this right now for SO.
I'm working on a userscript that provides an alternate "clean" stylesheet for the StackExchange network. I have no access to the SO engine. I am using the Chrome Inspector to look at how the elements are set up. I recommend the same. (Although it is a little different, since I'm modifying the original CSS file.)
You can easily identify what you want to style with the Inspector and then work from there. I would suggest that you ask your client for a list of classes and IDs though. (I got that in the form of an existing stylesheet, you can go about it in a different way, if that suits you and your client.)

Very basic HTML/scripting/active page question

A friend has asked me for help with her website design. Although I know a fair amount about the basics behind HTML, XML, Php, ASP.Net, javascript, etc., I'm not really comfortable sitting down and coding from scratch. All of the work I do is in Java, C++, and so on.
My friend would like to add a vertically scrolling marquee to her site - no problem, there is code for that all over the internet. Here is the tricky part - she would like the text to be dynamically pulled from another website. This isn't like a simple text file, either - it's a list of names from a specific blog post, so there would be a lot of text processing involved to wade through all of the other markup, and extract the relevant info.
The way I see it, here are her options -
1) Write some kind of a perl script or somesuch that is set to run daily. This script will visit the blog and extract the necessary info. It will then update the HTML file's marquee text with its new info.
2) Some sort of active page written in ASP or PHP that will dynamically build the marquee (and the rest of the site) each time the site is visited, basically doing the work of the perl script each time. This seems like it has the potential to be somewhat slow.
Per my understanding, those are her only options. Am I correct? There is no simply way to do this in javascript that I am just missing? I know you can reference an image to be dynamically pulled with the marquee, but this isn't that simple...
Thanks.
EDIT: I guess where I was going with my question was this: Unless I implement this statically, this is going to be fairly involved, right? I believe it is over my head. This is why I would like to simply copy/paste the text list into the html document. It would need to be updated every time the blog does, but that only appears to happen every few months, so that's not a large chore. I realize this is a lazy solution, but this is from someone very inexperienced in web development.
For reference, this is the SPECIFIC blog post which the text will come from, and my friend would ONLY like to display that list of names that begins when you scroll several paragraphs down.
http://truthnottasers.blogspot.com/2008/04/what-follows-are-names-where-known.html

It depends what the list of names looks like, i.e. how much intelligence is needed to parse it. But this could be something that could be fairly easily be pulled, parsed and displayed using Ajax, for example in the jquery flavour.

All the blogs I have ever seen have an RSS feed. Why not just grab the feed?... Google provides javascript that does only this.
Google Ajax Feed API

The RSS suggestion sounds good. If you can't get it in the RSS you could screen scrape the content.
If you could do it with Javascript I think it would suffer the same resource issues as your once a day Perl script and every load asp/php methods since it would still have to fetch the web content by making a call to the web site.
Another option is to use asp.net and enable caching so that when other visitors come to the site instead of getting the page all over again it serves up the cached page. You can set this to cache for 24 hours or so. I'm sure other server languages have similar features. Basically this would be the same as your once a day Perl method but keep it within a web framework.
Another hacky solution would be to use an iframe and frame the content with javascript so that it only shows the content you want to show. Of course you'll have no control over the formatting (background, fonts) of the iframe and if the content gets bigger or changes position you'll have problems.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.