Is there jsoup-like HTML parser for PHP? - php

I need HTML parser for PHP that can use CSS selectors to select elements, in Java we have jsuop. Is there such a library for PHP?

Try phpQuery; it uses CSS-style selection similar to jQuery, which by the sound of your description is similar to jsoup.

I use this one: http://simplehtmldom.sourceforge.net/

Related

Strip <style> blocks from HTML using regex [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I want to be able to strip inline css {} blocks from HTML using preg_replace. Anyone know the regex for that?
UPDATE
i wont be controlling the pages. I want to strip all markup from a page, an just leave the content.
There is a great 3rd-party library that makes simple DOM manipulations like these really easy.
$html = new simple_html_dom();
$html->load($inputString);
foreach($html->find('style') as $style)
$style->outertext = '';
$outputString = $html->save();
If you cannot use 3rd-party libraries for some reason, using PHP's built-in DOM module is still a better option than regex.
If you want to keep the tags but only remove their contents for some reason use innertext instead of outertext.
For stripping inline css, this method seems rather odd to me. Why don't you approach this using javascript or even jQuery?
Just invoke removeAttr with jQuery.
removerAttr | jQuewry API
First, regexes are not the way to parse HTML. If you actually want to parse HTML, and can't use an existing solution, then use the DOM module in PHP. http://php.net/manual/en/book.dom.php
Fortunately, PHP already has a function that will strip tags from a block of HTML. It is called strip_tags(). http://php.net/manual/en/function.strip-tags.php

DOMDocument wrappers?

Are there any HTML parsers written in PHP that use DOMDocument for parsing?
I'm basically looking for a wrapper class that provides nicer and more natural API than DOMDocument, which is problematic to work with.
There is SmartDOMDocument, its fixes a few things like encoding and outputting as string.
I don't know of any other wrappers, but you can use an alternative to DOMDocument:
PHPQuery
PHP Simple HTML DOM Parser
Ganon
Also, do you realize DOMXPath exists?
It makes it way easier to retrieve values.
http://www.phpbuilder.com/columns/PHP_HTML_DOM_parser/PHPHTMLDOMParser.cc_09-07-2011.php3 is another possibility.

Html Parser for PHP like Java

I have been developing Java programs that parse html source code of webpages by using various html parsers like Jericho, NekoHtml etc...
Now I want to develop parsers in PHP language. So before starting, I want to know that are there any html parsers available that I can use with PHP to parse html code
Check out DOMDocument.
Example #1 Creating a Document
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();
The builtin class DOM parser does a very good job. There are many other xml parsers, too.
DOM is pretty good for this. It can also deal with invalid markup, however, it will throw undocumented errors and exceptions in cases of imperfect markup so I suggest you filter HTML with HTMLPurifier or some other library before loading it with the DOM.

Parsing of badly formatted HTML in PHP

In my code I convert some styled xls document to html using openoffice.
I then parse the tables using xml_parser_create.
The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.
The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.
Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?
A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are standards compliant
An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :
The function parses the HTML contained
in the string source . Unlike loading
XML, HTML does not have to be
well-formed to load.
And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.
There is SimpleHTML
For repairing broken HTML, you could use Tidy.
As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.
See http://www.ibm.com/developerworks/library/x-pullparsingphp.html
Any particular reason you're still using the PHP 4 XML API?
If you can get away with using PHP 5's XML API, there are two possibilities.
First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.
Second option - you could try the HTML parser based on the HTML5 parser specification:
http://code.google.com/p/html5lib/
This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.
A solution is to use DOMDocument.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
</div>error.
<p>another error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

DOM manipulation in PHP

I am looking for good methods of manipulating HTML in PHP. For example, the problem I currently have is dealing with malformed HTML.
I am getting input that looks something like this:
<div>This is some <b>text
As you noticed, the HTML is missing closing tags. I could use regex or an XML Parser to solve this problem. However, it is likely that I will have to do other DOM manipulation in the future. I wonder if there are any good PHP libraries that handle DOM manipulation similar to how Javascript deals with DOM manipulation.
PHP has a PECL extension that gives you access to the features of HTML Tidy. Tidy is a pretty powerful library that should be able to take code like that and close tags in an intelligent manner.
I use it to clean up malformed XML and HTML sent to me by a classified ad system prior to import.
I've found PHP Simple HTML DOM to be the most useful and straight forward library yet. Better than PECL I would say.
I've written an article on how to use it to scrape myspace artist tour dates (just an example.) Here's a link to the php simple html dom parser.
The DOM library which is now built-in can solve this problem easily. The loadHTML method will accept malformed XML while the load method will not.
$d = new DOMDocument;
$d->loadHTML('<div>This is some <b>text');
$d->saveHTML();
The output will be:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>This is some <b>text</b></div>
</body>
</html>
For manipulating the DOM i think that what you're looking for is this. I've used to parse HTML documents from the web and it worked fine for me.

Categories