The question title says it all, after a bit of Googling and several days of tinkering with code, I cannot figure out how to download the plain text of a webpage.
Using strip_tags(); still leaves the JavaScript and CSS and trying to clean it up with regex also causes issues.
Is there any (simple or complicated) way to download a webpage (say a Wikipedia article) in plain-text using PHP?
I downloaded the page using PHP's file_get_contents(); as here:
$homepage = file_get_contents('http://www.example.com/');
As I said, I tried using strip_tags(); etc but I can't get the plain text.
I've tried using: http://millkencode.googlecode.com/svn/trunk/htmlxtractor/ContentExtractor.php to get the main content but it doesn't seem to work.
This is not nearly as easy as it seems. I'd recommend looking on something like PHP Simple HTML DOM Parser. Aside from JavaScript and CSS being hard to remove (and using RegEx for HTML is not proper) there could still be some inline styling there and stuff like that.
This, of course, is relative to the complexity of the HTML. strip_tags could be sufficient in some cases.
Use this code:
require_once('simple_html_dom.php');
$content=file_get_html('http://en.wikipedia.org/wiki/FYI');
$title=$content->find("#firstHeading",0)->plaintext ;
$text=$content->find("#bodyContent",0)->plaintext;
echo $title.$text;
http://simplehtmldom.sourceforge.net
Related
I am using file_get_html('URL') of PHP Simple HTML DOM Parser to get the source code of any website. However, I have problems getting the source of one specific website:
https://www.nkbm.si/tecajne-liste-menjalnica?currencyexchangetypeid=1
If I echo $html I receive some strange characters. Looks like website is protected from scrapping. Is this possible? Any way around?
Print screen of parsed html...
Thanks.
I found an answer. The source code was gziped. So I had to unzip it first.
I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?
One pro for simplehtmldom is support of invalid html, is that sufficient in itself?
strip_tags is sufficient for that.
Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.
https://github.com/mtibben/html2text
Install using composer:
composer require html2text/html2text
Basic usage:
$html = new \Html2Text\Html2Text('Hello, "<b>world</b>"');
echo $html->getText(); // Hello, "WORLD"
You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks
You would also be able to filter text from elements that aren't displayed (inline style=display:none)
That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task
If you just want a plain text rendering of a page then strip_tags is faster and simpler. If you want to do any manipulation of the text during that process, however, simplehtmldom is going to serve you better in the long run.
You may also want to remove slashes stripslashes()
I can't seem to get this to work and I was hoping for some help.
I'm trying to capture the contents of a specific div (please save the DOM talk, for this specific purpose it doesn't really come into play.)
The problem is, I can't seem to get it to work if there is another div with attributes before it on the same line. I tried specifying only match if there's no > between <div and class="myClass", but I think I'm doing it wrong.
I'm still pretty mystified by regex.
/<div(?!>).*?class="myClass".*?>(.*?)<\/div>/mi
(semi) Working example: http://regex101.com/r/cW0lW6
Try
/<div(?=\s)(?:(?!>).)+?class="myClass".*?>(.*?)<\/div>/si
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML.
See: RegEx match open tags except XHTML self-contained tags
I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.
You can use this (simple way):
~<div[^>]+?class="myClass"[^>]*>(.*?)</div>~si
or this (more efficient way if you have a lot of attributes):
~<div(?>[^>c]++|\Bc|c(?!lass=))+class="myClass"[^>]*+>(.*?)</div>~si
Note that these patterns don't work if your div tag contains another div tag.
I used the following code to remove script, link tags from my string,
$contents='<script>inside tag</script>hfgkdhgjh<script>inside 2</script>';
$ss=preg_replace('#<script(.*?)>(.*?)</script>#is', '', $contents);
echo htmlspecialchars($ss);
it works fine. But can I use anything that similar to html parsing rather than preg_match for this?
Here are few things you can do
htmlspecialchars() can prove those tags useless
striptags() removes all HTML tags
But the technique you are using is the correct one. However here is a improved version for that
echo preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $contents);
HTML Purifier is always a good choice. phpQuery has also come in handy a few times.
If you are sanitizing content, it's very easy to make mistakes with regular expressions... read this post. It just depends what you're trying to achieve.
I'm trying to parse strings that represent source code, something like this:
[code lang="html"]
<div>stuff</div>
[/code]
<div>stuff</div>
As you can see from my previous 20 questions, I tried to do it with PHP's regex functions, but ran into many problems, especially when the string is very big...
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
What I need it to do is:
be able to convert all content from within [code] tags with html entities
be able to run some kind of a filter (a callback function of mine) only on content outside of the [code] tags
thank you
edit:
I ended up using this:
convert all <pre> and <code> tags to [pre] and [code]:
str_replace(array('<pre>', '</pre>', '<code>', '</code>'), array('[pre]', '[/pre]', '[code]', '[/code]'), $content);
get contents from between [code]..[/code] and [pre]...[/pre] and do the html entity conversion
preg_replace_callback('/(.?)\[(pre|code)\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)/s', 'self::specialchars', $content);
(i stole this pattern from wordpress shortcode functions :)
store the entity converted content in a temporary array variable, and replace the one from $content with a unique ID
I can now safely run my filter on $content, because there's no code in it, just the ID (this filter does a strip_tags on the entire text and converts stuff like http://blabla.com to links)
replace the unique IDs from $content with the converted code blocks from the array variable
do you think it's ok?
HTML Purifier http://htmlpurifier.org/
But you are facing same issues just like in your 20 previous questions.
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
There's the BBCode PECL extension, but you'd need to compile it.
There's also PEAR's HTML_BBCodeParser, though I can't vouch for how effective it is.
There are also a few elsewhere, but I think they're all pretty rigid.
I don't believe that either of those do what you're looking for, with regard to having a callback for tag contents (and then #webarto is totally correct in that HTMLPurifier is the right tool to use when processing the contents). You might have to write your own here. I've previously written about my experiences doing the same that you might find helpful.