I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?
One pro for simplehtmldom is support of invalid html, is that sufficient in itself?
strip_tags is sufficient for that.
Extracting text from HTML is tricky, so the best option is to use a library like Html2Text. It was built specifically for this purpose.
https://github.com/mtibben/html2text
Install using composer:
composer require html2text/html2text
Basic usage:
$html = new \Html2Text\Html2Text('Hello, "<b>world</b>"');
echo $html->getText(); // Hello, "WORLD"
You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks
You would also be able to filter text from elements that aren't displayed (inline style=display:none)
That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task
If you just want a plain text rendering of a page then strip_tags is faster and simpler. If you want to do any manipulation of the text during that process, however, simplehtmldom is going to serve you better in the long run.
You may also want to remove slashes stripslashes()
Related
The question title says it all, after a bit of Googling and several days of tinkering with code, I cannot figure out how to download the plain text of a webpage.
Using strip_tags(); still leaves the JavaScript and CSS and trying to clean it up with regex also causes issues.
Is there any (simple or complicated) way to download a webpage (say a Wikipedia article) in plain-text using PHP?
I downloaded the page using PHP's file_get_contents(); as here:
$homepage = file_get_contents('http://www.example.com/');
As I said, I tried using strip_tags(); etc but I can't get the plain text.
I've tried using: http://millkencode.googlecode.com/svn/trunk/htmlxtractor/ContentExtractor.php to get the main content but it doesn't seem to work.
This is not nearly as easy as it seems. I'd recommend looking on something like PHP Simple HTML DOM Parser. Aside from JavaScript and CSS being hard to remove (and using RegEx for HTML is not proper) there could still be some inline styling there and stuff like that.
This, of course, is relative to the complexity of the HTML. strip_tags could be sufficient in some cases.
Use this code:
require_once('simple_html_dom.php');
$content=file_get_html('http://en.wikipedia.org/wiki/FYI');
$title=$content->find("#firstHeading",0)->plaintext ;
$text=$content->find("#bodyContent",0)->plaintext;
echo $title.$text;
http://simplehtmldom.sourceforge.net
I'm trying to parse strings that represent source code, something like this:
[code lang="html"]
<div>stuff</div>
[/code]
<div>stuff</div>
As you can see from my previous 20 questions, I tried to do it with PHP's regex functions, but ran into many problems, especially when the string is very big...
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
What I need it to do is:
be able to convert all content from within [code] tags with html entities
be able to run some kind of a filter (a callback function of mine) only on content outside of the [code] tags
thank you
edit:
I ended up using this:
convert all <pre> and <code> tags to [pre] and [code]:
str_replace(array('<pre>', '</pre>', '<code>', '</code>'), array('[pre]', '[/pre]', '[code]', '[/code]'), $content);
get contents from between [code]..[/code] and [pre]...[/pre] and do the html entity conversion
preg_replace_callback('/(.?)\[(pre|code)\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)/s', 'self::specialchars', $content);
(i stole this pattern from wordpress shortcode functions :)
store the entity converted content in a temporary array variable, and replace the one from $content with a unique ID
I can now safely run my filter on $content, because there's no code in it, just the ID (this filter does a strip_tags on the entire text and converts stuff like http://blabla.com to links)
replace the unique IDs from $content with the converted code blocks from the array variable
do you think it's ok?
HTML Purifier http://htmlpurifier.org/
But you are facing same issues just like in your 20 previous questions.
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
There's the BBCode PECL extension, but you'd need to compile it.
There's also PEAR's HTML_BBCodeParser, though I can't vouch for how effective it is.
There are also a few elsewhere, but I think they're all pretty rigid.
I don't believe that either of those do what you're looking for, with regard to having a callback for tag contents (and then #webarto is totally correct in that HTMLPurifier is the right tool to use when processing the contents). You might have to write your own here. I've previously written about my experiences doing the same that you might find helpful.
I want to get the <form> from the site. but between the form part in this situation, there still have mnay other html code. how to remove them? I mean how to use php just regular the and part from the site?
$str = file_get_contents('http://bingphp.codeplex.com');
preg_match_all('~<form.+</form>~iUs', $str, $match);
var_dump($match);
You should not use regular expressions for extracting HTML content. Use a DOM parser.
E.g.
$doc = new DOMDocument();
$doc->loadHTMLFile("http://bingphp.codeplex.com");
$forms = $doc->getElementsByTagName('form');
Update: If you want to remove the forms (not sure if you meant that):
for($i = $forms.length;$i--;) {
$node = $forms->item($i);
$node->parentNode->removeChild($node);
}
Update 2:
I just noticed that they have one form that wraps the whole body content. So this way or another, you will get the whole page actually.
The regex problem lies in the greedyness. For such cases .+? is advisable.
But what #Felix said. While a regular expression is workable for HTML extraction, you often look for something specific, and should thus rather parse it. It's also much simpler if you use QueryPath:
$str = file_get_contents('http://bingphp.codeplex.com');
print qp($str)->find("form")->html();
The best way i can think of is to use the Simple HTML DOM library with PHP to get the form(s) from the HTML page using DOM queries.
It is a little more convenient than using built-in xml parsers like simplexml or domdocument.
You can find the library here.
Normally you should use DOM to parse HTML, but in this case the web site is very far from being standard HTML, with some of the code being modified in place by javascript. It can therefore not be loaded into the DOM object. This might be intentional, a way of obfuscating the code.
In any case, it is not so much your RE (although using a non-greedy match would help), but the design of the site itself which is preventing you from parsing out what you want.
What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.
Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.
Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.
var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/
there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/
You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.
I googled a lot, for those kind of problems have been asked a lot in the past. But I didn't find anything to match my needs.
I have a html formatted text from a form. Just like this:
Hey, I am just some kind of <strong>formatted</strong> text!
Now, I want to strip all html tags, that I don't allow. PHP's built-in strip_tags() Method does that very well.
But I want to go a step further: I want to allow some Tags only inside or not inside of other tags. I also want to define my own XML Tags.
Another example:
I am a custom xml tag: <book><strong>Hello!</strong></book>. Ok... <strong>Hi!</strong>
Now, I want the <strong/> inside of <book/> to be stripped, but the <strong>Hi!</strong> can stay the way it is.
So, I want to define some rules of what I allow or don't allow, and want to have any filter do the rest.
Is there any easy way to do that? Regexp aren't what I'm looking for, for they can't parse html properly.
Regards, Jan Oliver
Don't think there is such a thing, I think not even HTML Purifier does that.
I suggest you parse the XHTML by hand using something like Simple HTML Dom.
Use a second argument to strip_tags, which is allowable tags.
$text = strip_tags($text, '<book><myxml:tag>');
I don't think there's a way to only strip certain tags if they're not inside other tags, without using regex.
Also, regex aren't not good at parsing HTML, but it's slow compared to the options. But that's not what you're doing here, anyways. You're going through the string and removing things you don't want. And for your complex requirement I think your only option is to use regex.
To be completely honest I think you should decide which tags are allowable and which aren't. Whether or not they are inside of other tags shouldn't matter at all. It's markup, not a script.
The second argument shows that you cal allow some tags:
string strip_tags ( string $str [, string $allowable_tags ] )
From php.net
I wrote my own Filter class based on the DOM classes of PHP. Look here: XHTMLFilter class