Text Parser with PHP, like Instapaper

Text Parser with PHP, like Instapaper - php

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.

You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.

you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki

Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent

You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.

this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.

Related

How to separate or identify header/footer/carousell & other parts of any website?

I want to separate out header/footer/sidebar/carousel of homepage of any of the website.
Example if I enter google.com or alibaba.com or flipkart.com
I can retrieve that homepage via PHP CURL. (some of them are encoded which we can't)
But question is how to identify that? Each platform is using different programming language.
Is they any API available free/paid in market? Is it possible?
Here what I have tried
$url = "https://www.google.com";
$homepage = file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($homepage);
echo "<pre>";
print_r($doc);
exit;
This is example of PHP language. I am looking to see solution in any of the language (Java/.NET).
Main question. Is it possible or not?
So there will be REST API like this & give response in JSON.
POST api/getWebsiteData
Params : <Website URL>
Sample Response
{
"header" : <html goes here>,
"menu" : <html goes here>,
"footer" : <html goes here>,
.....
....
}
I agree, we will not get 100% solution for this, because some of the website view source is encrypted.

The short answer is no, this is not possible.
The longer answer is that you could build something that might satisfy your needs, but I can pretty much guarantee that it won't work on the bulk of the web without lots, and lots, and lots of tweaking. And I mean lots. Like so much work that you become Google.
A web page is really composed of two things, HTML and DOM. HTML is what you are going to get from functions like file_get_contents, and when the browser interprets them they get converted into DOM. Further, once JavaScript gets involved, it can also modify the DOM as it pleases. Some web pages have a pretty 1-to-1 mapping for initial load with HTML to DOM but others have very minimal HTML and rely on JS to create and manipulate the DOM.
Next, there's CSS and CSSOM, the later of which is what JS has access to, similar to HTML's DOM. In CSS you can say "put the header at the bottom and the footer at the top". How many people do that? Probably zero, that's just a far-fetched example, but there are very many smaller nuanced examples. Some people believe that there should only be one header on a site, while others say that headers hold headings. For instance, you can (and I have seen) headers inside of a footer. (I'm not saying whether I agree or disagree with that, either.) Also, the web is full of HTML with CSS classes like:
<div class="a">...</div>
<div class="b">...</div>
Which one of those is the header and which is the footer? Or, which is the sidebar? Is one possibly a menu? Even better, go to the ReactJS official site and inspect their DOM and you'll see code like this:
<div class="css-1vcfx3l"><h3 class="css-1xm4gxl"></h3><div>
Do those classes make sense to you?
So if you are going down this path, you are going to have to figure out where you are going to start. Do you just want to parse HTML and ignore JS/CSS/DOM/CSSOM? If so, that's generally called screen scraping (or at least it was when I did it a decade ago).
If you want to get more complex, most browsers can be run in "headless mode" and then interacted with. For instance, there's Chromium in headless mode, but I'd really recommend using an abstraction over that such as Symfony's Panther if you are in PHP or Puppeteer if you are in server-side JS. (I know there are dozens of alternatives, anyone reading this, feel free to throw them in the comments.)
Regardless of simple or complex, you're going to want to write your own rules. A semi-modern site written in the past couple of years has a good chance of having a root or near-root <HEADER>, <MAIN> and <FOOTER> tags. If you find those, your general rules are probably going to be simpler. You stand a good chance of also finding <ASIDE> and other semantic HTML5 tags in there, too.
If you don't find those, you might still be able to look at near-root tags for <div class="header"> and similar. You are going to possibly need to handle alternative versions of header, especially across languages (human, not computer, so English, Spanish, etc.).
Using these rules, I think you could generally build something that would parse a good number of sites on the web.
I big word of caution, however, is that home pages tend to be weird and one-offs, because they tend to holds a subset of everything else on the site but no actual content of their own. In that regard, you'll usually still find a header and footer, but inside almost everything feels like a sidebar or similar.
As to carousels? That one is honestly really tough. Carousels are built with JS so if you only look at HTML you might only find a <UL> with a bunch of images. Actually, as I write this, I think I would target an <UL> with images and assume that they are a carousel. There will definitely be false-positives but it is a pretty common pattern. Not everyone is a <UL> fan, however, so they might be just regular <DIV>.
I say all this because I have built these in the past, but for very specific sites and for very specific reasons. Writing a general parser that could work everywhere is, as I said at the beginning, a lot of work.

This is a tough one and, unless you're google, I doubt it will be possible to make a solution that works on more than a couple websites.
First off lets start out by taking a couple of websites and looking at what they send to the client.
The HTML of a Wikipedia article looks something like this
<h2><span class="mw-headline" id="History">History</span></h2>
<h3><span class="mw-headline" id="Development">Development</span></h3>
<div class="thumb tright"><div class="thumbinner" style="width:172px;"><img alt="Photograph of Tim Berners-Lee in April 2009" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Tim_Berners-Lee_April_2009.jpg/170px-Tim_Berners-Lee_April_2009.jpg" decoding="async" width="170" height="234" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Tim_Berners-Lee_April_2009.jpg/255px-Tim_Berners-Lee_April_2009.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Tim_Berners-Lee_April_2009.jpg/340px-Tim_Berners-Lee_April_2009.jpg 2x" data-file-width="1195" data-file-height="1648" /> <div class="thumbcaption"><div class="magnify"></div>Tim Berners-Lee in April 2009</div></div></div>
<p>In 1980, physicist Tim Berners-Lee, a contractor at CERN, proposed and prototyped ENQUIRE, a system for CERN researchers to use and share documents. In 1989, Berners-Lee wrote a memo proposing an Internet-based hypertext system.<sup id="cite_ref-3" class="reference">[3]</sup> Berners-Lee specified HTML and wrote the browser and server software in late 1990. That year, Berners-Lee and CERN data systems engineer Robert Cailliau collaborated on a joint request for funding, but the project was not formally adopted by CERN. In his personal notes<sup id="cite_ref-4" class="reference">[4]</sup> from 1990 he listed<sup id="cite_ref-5" class="reference">[5]</sup> "some of the many areas in which hypertext is used" and put an encyclopedia first.
and this would be easy enough to parse with a php / python / Java program and separate into different parts.
Now lets look at the google support pages. The source is basically 2000 lines of javascript and that's it.
parsing this would be possible, but much harder since you need to actually render the page and execute the javascript before the <header>, <div> and <p> tags appear in the DOM.
I believe it would be possible to create an api thay scans websites like wikipedia or stackoverflow since they generate the HTML on the server side and only require the client to render it and apply css styles to it.
If a website is based upon a technology like react.js you'll see that the entirety of the page is simply javascript and nothing can be processed until that has been executed and rendered.
Would it be possible to reder it and parse it afterwards? Probably yes, but an API that can do this for any given website is so much work that you're probably better of training an AI to read webpages and let it point them out for you.

How do you process invalid HTML in PHP?

I've seen this question, which is very nice and informative. However, it doesn't deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?

You need a DOM Parser. Php has one. And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.

I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.
You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.

Does it have to be PHP? Python has a wonderful library called Beautiful Soup ("You didn't write that awful page. You're just trying to get some data out of it"). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.
(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)

What is the best way to parse Wikipedia markup in PHP?

I'm trying to parse specific Wikipedia content in a structured way. Here's an example page:
http://en.wikipedia.org/wiki/Polar_bear
I'm having some success. I can detect that this page is a "specie" page, and I can also parse the Taxobox (on the right) information into a structure. So far so good.
However, I'm also trying to parse the text paragraphs. These are returned by the API in Wiki format or HTML format, I'm currently working with the Wiki format.
I can read these paragraphs, but I'd like to "clean" them in a specific way, because ultimately I will have to display it in my app and it has no sense of Wiki markup. For example, I'd like to remove all images. That's fairly easy by filtering out [[Image:]] blocks. Yet there are also blocks that I simply cannot remove, such as:
{{convert|350|-|680|kg|abbr=on}}
Removing this entire block would break the sentence. And there are dozens of notations like this that have special meaning. I'd like to avoid writing a 100 regular expressions to process all of this and see how I can parse this in a smarter way.
My dilemma is as follow:
I could continue my current path of semi-structured parsing where I'd
have a lot of work deleting unwanted elements as well as "mimicing"
templates that do need to be rendered.
Or, I could start with the rendered HTML output and parse that, but my worry is that it's just as fragile and complex to parse in a structured way
Ideally, there's be a library to solve this problem, but I haven't found one yet that is up to this job. I also had a look at structured Wikipedia databases like DBPedia but those only have the same structured I already have, they don't provide any structure in the Wiki text itself.

There are too many templates in use to reimplement all of them by hand and they change all the time. So, you will need actual parser of the wiki syntax that can process all the templates.
And the wiki syxtax is quite complex, has lots of quirks and no formal specification. This means creating your own parser would be too much work, you should use the one in MediaWiki.
Because of this, I think getting the parsed HTML through the MediaWiki API is your best bet.
One thing that's probably easier to be parsed from wiki markup are the infoboxes, so maybe they should be a special case.

"Pretty" code and PHP

I understand the concepts of \t and \n in most programming languages that have dynamic web content. What I use them for, as do most people, is shifting all the fancy HTML to make it all readable and "pretty" in view source. Currently, I am making a website that uses PHP include ("whateverfile.php") to construct the layout. You can probably tell where this is going.
How can I tab over a whole block of a PHP include so it aligns with the rest of the page's source?
If this question is worded incorrectly or doesn't make sense, English is my native language so I can't use any excuses.

You can do this by using a layer in your application that is taking care of the output, like a theme system. This will add the additional benefit to you that your code will be better separated into data-processing and output-processing.
A good example is given in the following article: When Flat PHP meets Symfony.
Next to that, there is another trick you can do: Install an output buffer and then run tidy on it to make it look just great: Tidying up your HTML with PHP 5.
On top of this you can always put tabs into your include's output, however you don't know always how much tabs this will need. There are some other tricks related to output buffering and intending html fragments when they return from includes, however this is very specific and most often of not much use. So the two articles linked above might give you two areas to look into which are of much more use for you in the end.

I think you can do it adding extra tabs in the included .php pages. But i wouldn't recommend that (obviously). Instead, use tools like firebug or chrome's inspect element to look at the code.

Find important text in arbitrary HTML using PHP?

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied

I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/

Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.