Scraping only main contents of webpage (ignore header, footer & sidebars)

Scraping only main contents of webpage (ignore header, footer & sidebars) - php

I am familiar with scraping and using XPATH in php to parse the DOM to get what i want from a page. What i would like to hear are some suggestions on how i could programatically ignore the header, footer and sidebars on a page, and only extract the main body content.
Situation given is that there is no specific target, so i cannot simply ignore specific id's like #header and #footer, because every page is written slightly differently.
I know that google does this, i know it must be possible, i just don't really know where to start with it.
Thanks!

There is no definite way to determine it but you can get reasonable results with heuristic methods. A suggestion:
Scrape two or more pages from the same website and start comparing them block by block starting on the top level, going a few levels deep until the blocks are sufficient equal. The comparison would not be == but a similarity index, for example with similar_text.
Blocks above a certain percentage of similarity will most likely be header, footer or menu. You will have to find out by experiment which threshold is useful

There is no small or quick way to scrape content from webpage. I have done a lot of these. There is not simple rule about this. Earlier in the html3/table based design days, there was different way to identify and site design itself was limited. screen size was limited so often menu was on top side and no space for right or left panels. then came era with panels with table designs. now is the time with floating content. And then we even use overflow:hidden so its even more hard to know body by word count, etc.
When writing html file, the code is never tagged as a content or menu. You can sometimes derive that from class names but that is not universal. the content gets its size and positions from CSS. so your parser alone can never determine the body part of the page. If you use a embedded html viewer and use DHTML/JS to locate sizes of blocks after rendering, there might be some way to do it but still it will never be universal. My suggestion is to make your parser and improve it case by case.
For google, it has made programs for most combinations of html designs. But even for google, making a universal parser, i think is impossible.

Related

Formatting HTML correctly from cURL requests

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">

</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.

If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.

I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

How should I keep webpages consistent [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm fairly new to web-centric design and programming.
I've got a HTML + CSS with PHP page I'm quite happy with. It's got a header, a main content area, and a sidebar.
Now I'm working on my second page. The second page should have the same look as the first page. I'll reuse the CSS, but there seems to be a lot of repetition between the first and second pages (the content in the header, and the sidebar, for example, is almost identical).
Is it normal to repeat things over multiple pages? If, later, I want to change something, I'm going to have to change it in (potentially) numerous places; that seems rather silly, so I presume I'm missing something.
I thought perhaps I'd use the "small parts" from my CSS in a "larger" wrapper, encompassing the entire Header, perhaps, and then include that in both pages; I'm not sure if that's the right direction I should be heading (or how I'd do it).
I also thought perhaps I could use PHP to dynamically generate the page each time, wrap the generation in a class, and then end up with something like myClass->generateHeader(). I'm using PHP to generate some of the page anyway, so the conceptional leap isn't too great; on the other hand, I imagine that generating the page each request is worse in terms of performance, and (from my brief searching) seems to involve several hundred lines of PHP to generate a rather short stretch of HTML, (assuming it's anything more complicated than a bunch of echo statements containing the HTML I'd have written anyway.
Searching for "creating HTML templates" is rather fruitless, but I'm not sure what kind of keywords I'd be using to ask how this is normally handled.
How do you adhere to DRY and avoid repeating yourself over several related pages in a website?

You can use the php include method: http://php.net/manual/en/function.include.php in order to not have to repeat parts of your page that are always needed. For example, the header and footer, navigation, etc..
To answer your other question, using a class to store html sections is another way to go and can prove to be useful. It also won't add a lot of extra processing time unless your class needs to do a lot of calculations upon initialization.

A common practice is to seperate out a header and footer file and include them. A good book on this would be Larry Ullman's php and mysql for dynamic web pages.
But for a quick overview go to: http://www.davidjrush.com/blog/2009/08/php-header-and-footer-templates/

If you fear that some things do repeat across multiple resources, keep them out.
Add them in post-processing instead.
You do this best on the server level. You can also add caching if needed at that level, so your page only get's generated once and no-more again (or until the cache expires).

I agree with Matt K, that is probably more programmers stack exchange related but I'll provide some tips anyway.
I think the normal thing is to create some kind of header/footer files. For example, your header file would include everything you want on every page, i.e. logo, menu, css includes etc. And the footer is useful for: closing wrapper divs, google adsense code etc.
Once these files have been created, for each page you just do:
<?php include("header.php"); ?>
BODY OF PAGE
<?php include("footer.php"); ?>
:)

If any of the components are static components (e.g. the sidebar), then you can put the static HTML in a file and simply include it in the relevant place. (OO alternative: have a View object pull in the static HTML for you.)
If you need some custom logic in these components, then an include could still work, but since you discuss a class-based alternative I'd suggest building an MVC architecture into your application.
An MVC framework would probably consider the header/sidebar/footer etc. as partial views (smaller components in the main view), or part of your overall layout (your header/sidebar/footer wrap around your main content body).
The layout option makes a lot of sense as it decouples the view for the main content from your overall idea of how the page components stick together. It also means it's really easy to modify the layout (for instance, put the sidebar on the right instead of the left by changing one layout file).

Find important text in arbitrary HTML using PHP?

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied

I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/

Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.

Text Parser with PHP, like Instapaper

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.

You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.

you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki

Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent

You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.

this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.

Good PHP Text Pagination

I'm working on an information warehousing site for HIV prevention. Lots of collaborators will be posting articles via a tinyMCE GUI.
The graphic designers, of course, want control over page lengths. They would like automatic pagination based on the height of content in the page.
Anyone seen AJAX code to manage this?
Barring that anyone seen PHP code that can do a character count and a look-behind regex to avoid splitting words or tags?
Any links much appreciated!

If it doesn't need to be exact there's no reason you can't use a simple word count function to determine an appropriate place to break the page (at the nearest paragraph I suppose). You could go so far as to reduce the words per page based on whether there are images in the post, even taking the size of the images into account.
That could get ugly fast though, I think the best way to do it is to allow them to manually set the page dividers with a tag in the article that you can parse out. Something like [pagebreak] is pretty straightforward and you'll get much more logical and readable page breaks than any automated solution would achieve.

You don't just have to worry about character count, you also have to worry about image heights if there are images or any other kind of embedded objects in your pages that can take up height. Character count will also not give you an idea of paragraph structure (a single long paragraph with more characters than a page with many paragraphs might be shorter).
If you're willing to use JavaScript, that might be the ideal solution, post the entire article to the client and let JavaScript handle the pagination. From the client you can detect image and object heights. You could use PHP to place markers about where you think the pages should be, and then use JavaScript to make it happen. Unless the pages are very long I don't think you'll need to do several xmlHttpRequests (AJAX).
For just a straight PHP solution is also simple, but probably not ideal as you're not dealing with a matter of managing row counts. You could use a GET variable to determine where you are in the page.

Although this might not be the exact answer you're looking for, but you should really make sure your site doesn't have a fixed height. Flexible width's are really nice, but not as critical as the height.
Especially for a cause like this, and a content-heavy site; it's fair to require flexible heights.
As mentioned by apphacker, you can't really detect the height from within PHP and you're kind of stuck with javascript. If you're absolutely stuck with paging, it's probably better to let your content authors decide when to break off the page, so you break it on a real section, instead of mid-word, sentence, etc.
Edit: usability should dictate design, not the other way around. You're doing it wrong ;)

A good pagination is not a simple task. That's not a simple matter of coding. Scientific research by Plass (1981) proved that the optimal page breaking is in general NP-hard.
You should worry about floating figures, line breaks, different font styles,etc.
And the only thing an HTML engine can help you is parsing a page to a DOM tree. What about sizes? Yes you could have font width and font height, margins and paddings, picture sizes. But that's all. All the layout is on your shoulders. And doing it in javascript... meh...
So the only feasible solution of automatic fixed height pagination would be a server-side. PrinceXML is currently the best HTML2PDF converter. But it costs a lot.
If you are good with different page heights, you could use epalla's suggestion. But this is also not as simple as it seems.
Some references for pagination:
Optimal pagination techniques, Plass, 1981
On the Pagination of Complex Documents, 1998
Pagination reconsidered
Knuth's Digital Typography

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.