Extracting portion of the HTML page - php

Is it possible to extract a portion of a remote HTML page, and print it on another page, using PHP cURL, HTML DOM parser, or any other method, preserve the original formatting styles, images, tabs functioning?
For example, how to extract content of central column (with tabs and formatting, preserve the look of the original text), from http://ru.yahoo.com/?
As far as I understand, the script should process an external CSS, so that returned content has the same look as the original. What would be most appropriate way, if that's possible? If yes, an example would be highly appreciated. I looked several examples, but didn't find any solution for my case.

Well if I had to do it quickly (read: very dirty) I would do is this I think:
Pull the HTML from the remote server using standard PHP
Use the HTML that you stole took from the other site and add your own HTML to it down at the bottom.
Also add your own CSS to hide the html of the other site you don't want to be visible and style your own html.
Fiddle until it look okay enough. However: I think this will break the loading of the external JS files because of the same domain policy.
A nice approach would be this:
Pull the HTML from the remote server using standard PHP
Parse the HTML with some PHP HTML parser and strip out all external CSS and JS files and pull those files as well.
Use XPath to extract the parts that you need.
Create a new HTML document with your own HTML, the parts that you need, new links to your newly downloaded CSS and JS files. Also add your own CSS and JS to style the result.
You know: RSS was invented for this and if they don't provide an rss feed they most likely don't want you to get the content and post it on your own site. :P

Related

How to edit a php/html file using PHP

I'm trying to design an interface to change <img> sources throughout my website, so that I don't need to change the sources by hand anytime we want to change what images my site displays.
I looked at DOMDocument already, and while it's useful for accessing the different elements in a file, it wraps the files in it's own HTML tags, which messes up the files (I already tried looking here but those solutions didn't work for me)
Is there a better way to take a file, retrieve element information and attributes (like src and id) from img tags, edit those elements, and save the file?
Just to close this up - I ended up using Querypath to accomplish this task. It works pretty well and was pretty simple to utilize.

Advanced ePub reader

I'm trying to build an advanced ePub reader using jQuery and PHP/Zend Framework1.12 (for the epub3.0 format). The reader should contain the following features:
books should be displayed using pages (2 pages at a time)
the user should be able to navigate between pages and chapters using a slider
the user can create highlights and bookmark pages
the reader must be cross-browser (I don't care much about older versions of IE, but it must work on Safari, Mozilla, Chrome)
My idea is to make some kind of PHP parser that will handle the epub content and pass it on to the Javascript code in a more 'friendly' format, but I haven't worked with epubs before and I'm not sure where to start.
Here are a few questions that I have been struggling with:
The first problem I have encountered is how to extract the content from an .ePub archive and render it in a such a way that will allow the paginated view. What PHP library would you recommend for parsing epubs? I have already tested some libraries like BookGluttonEpub (seems quite old) and EPUBParser (difficult to understand since there are no examples and docs). Are there others I missed?
Should I clean the html code (like remove invalid tags for example) before passing it to the reader?
What do you consider is the best way to display the pages? Should I use CSS and the 'column' property? Or should I make a more advanced script that will split the html content of a chapter into pages?
Thanks
First extract the .epub file, its same like your zip file so you can use PHP unzip library and dont need to parse HTML or CSS. You can create your reader using HTML5 canvas and CSS 3 properties.
I think better option is to use HTML5 and CSS3 if you are not thinking IE compatibility.

Embed a portion of a website based on CSS ID tag

There is a portion of a website that I would like to include in my website. I could attempt to use an iframe however I believe this would not be the best approach. Is there a way to embed a portion of a website based on the CSS ID?
Is there a way to embed a portion of a website based on the CSS ID?
Theoretically, yes - you could fetch the external page using PHP, separate the element using a HTML parser, and show only the element's HTML.
However, you would lose all styling information using this approach, because the HTML will be rendered in your local page's context. Mixing your own and an external site's styles will often lead to chaos.
Unless you really need just the pure HTML markup from the external site, you may be best off with an iframe.

PHP : Lookup all CSS files for content

My requirement is to lookup all the CSS content (external, internal and inline) from a given URL for some specific CSS content. I am currently using the 'PHP Simple HTML DOM Parser' to lookup HTML. But is there a specific way I can achieve this for CSS, specifically, all types of CSS ?
Thanks in advance :)
It is possible to gather all CSS a HTML document includes (style attributes and elements) and links (link element, type CSS).
However, as far as I know, some CSS parser in PHP exists so far (but no interface to CSSOM). So you would be able to gather all CSS, but you would need to write/integrate a parser for CSS on your own.
Keep in mind that some sites use things like less which is then turned into CSS via javascript, so you might want to integrate less into PHP as well. Same applies for other, client-side CSS technologies.

Javascript based horizontal Scrolling of a multi-page PDF?

I'm wondering how I can accomplish horizontal scrolling of the pages of a PDF using JavaScript. Is it better to:
Convert the pages of the PDF into HTML files and then click left-right between iframes where src="...each page.html"?
Convert the pages of the PDF into some other HTML element besides iframe (e.g., DIV?) and then click left-right between elements containing the contents of each page.
I'd like to ensure that the PDF's text is searchable so I don't want to make its pages into images. I'm also skeptical of using iframes because of the formatting challenges of having multiple iframes in a single webpage. I've already tested this approach after converting the PDF to HTML using "PDFtoHTML" linux-based software and find that in general this is a suboptimal solution.
It seems like option 2 is the way to go but wouldn't know how to programmatically parse a PDF into multiple DIVs. Besides JavaScript, I'm familiar with PHP and Linux but not other languages if that would be helpful in thinking of solutions.
PDF plugin intercepts mouse events so there is no way to control it directly from the browser / JavaScript.
Your other method, converting to html, is feasible.
Converting a PDF page to a HTML file is more or less the exact same thing as "parsing it into a <div>". If you already found a tool that can do it for you ("PDFtoHTML"), just use that, and strip away everything except what's inside the <body> of the .html it outputs.

Categories