I want to get the <form> from the site. but between the form part in this situation, there still have mnay other html code. how to remove them? I mean how to use php just regular the and part from the site?
$str = file_get_contents('http://bingphp.codeplex.com');
preg_match_all('~<form.+</form>~iUs', $str, $match);
var_dump($match);
You should not use regular expressions for extracting HTML content. Use a DOM parser.
E.g.
$doc = new DOMDocument();
$doc->loadHTMLFile("http://bingphp.codeplex.com");
$forms = $doc->getElementsByTagName('form');
Update: If you want to remove the forms (not sure if you meant that):
for($i = $forms.length;$i--;) {
$node = $forms->item($i);
$node->parentNode->removeChild($node);
}
Update 2:
I just noticed that they have one form that wraps the whole body content. So this way or another, you will get the whole page actually.
The regex problem lies in the greedyness. For such cases .+? is advisable.
But what #Felix said. While a regular expression is workable for HTML extraction, you often look for something specific, and should thus rather parse it. It's also much simpler if you use QueryPath:
$str = file_get_contents('http://bingphp.codeplex.com');
print qp($str)->find("form")->html();
The best way i can think of is to use the Simple HTML DOM library with PHP to get the form(s) from the HTML page using DOM queries.
It is a little more convenient than using built-in xml parsers like simplexml or domdocument.
You can find the library here.
Normally you should use DOM to parse HTML, but in this case the web site is very far from being standard HTML, with some of the code being modified in place by javascript. It can therefore not be loaded into the DOM object. This might be intentional, a way of obfuscating the code.
In any case, it is not so much your RE (although using a non-greedy match would help), but the design of the site itself which is preventing you from parsing out what you want.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I want to be able to strip inline css {} blocks from HTML using preg_replace. Anyone know the regex for that?
UPDATE
i wont be controlling the pages. I want to strip all markup from a page, an just leave the content.
There is a great 3rd-party library that makes simple DOM manipulations like these really easy.
$html = new simple_html_dom();
$html->load($inputString);
foreach($html->find('style') as $style)
$style->outertext = '';
$outputString = $html->save();
If you cannot use 3rd-party libraries for some reason, using PHP's built-in DOM module is still a better option than regex.
If you want to keep the tags but only remove their contents for some reason use innertext instead of outertext.
For stripping inline css, this method seems rather odd to me. Why don't you approach this using javascript or even jQuery?
Just invoke removeAttr with jQuery.
removerAttr | jQuewry API
First, regexes are not the way to parse HTML. If you actually want to parse HTML, and can't use an existing solution, then use the DOM module in PHP. http://php.net/manual/en/book.dom.php
Fortunately, PHP already has a function that will strip tags from a block of HTML. It is called strip_tags(). http://php.net/manual/en/function.strip-tags.php
I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)
I know that the content starts off as <div id="content"> and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.
If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net
really good for this sort of thing.
Do not use regex, it can fail.
Use PHP's inbuilt DOM parse :
http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .
This is driving me nuts! A little piece of code that I can't seem to debug :( Basically I have an HTML file in a string and I want to find X inside until another X (same value) IF there is another one, if there isn't, then grab X until end of file.
The code that doesn't work:
$contents = "< div id="main" class="clearfix"> < div id="col-1">< div id="content">< div id="p19601634">< h1>< span id="ppt19601634">";
$regex = "!<div id="content">(.*?)(?:<div id="content">)!s";>
preg_match_all($regex, $contents, $matches);
Please notice that I added spaces before the DIV for display purpose and that I want to check with NEW LINES and TABS inside the HTML also (basically, there is a line return after the first DIV).
Right now, my code works if it finds many occurences of my search and it will return the searches. But if there is only one item found, it doesnt work.
Does someone knows this?
Thanks a bunch
Regular expressions are not and never will be the right tool for this job. "I have to use regular expressions" is not true. There is computer science theory to explain this: regular expressions are only capable of matching regular languages, but HTML (or XML) is a more sophisticated language than that.
Another solution for you besides DOM mentioned in #meder's answer is XSLTProcessor. XSLT is a declarative pattern-matching language like regular expressions. But XSLT is capable of matching the hierarchical structure of XHTML or XML.
See the answers in Simple XML parsing on PHP for more solutions, including an example of XSLTProcessor in my answer.
If you want to learn all about HTML scraping techniques in PHP, there's a book on the subject by Matthew Turland, titled php|architect's Guide to Web Scraping with PHP. It's available in digital form now, and should be in print soon.
If you can pry yourself away from PHP for a moment, try a package called Beautiful Soup. This package has one huge advantage: unlike DOM/XSLT parsers, Beautiful Soup doesn't choke if you direct it to parse an HTML page that has some bad markup. Since most web sites you will be scraping probably contain some mistakes, this is a pretty important advantage.
Use a DOM library and do something like..
$d = new DOMDocument();
$d->loadHTML($htmlString);
$content = $d->getElementById('content');
$inside = innerHTML( $content );
var_dump($inside);
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
#<p[^>]*>(.*)?</p>#sU
Explain :
I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
<p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
(.*)? : Everything (in "Ungreedy mode")
</p> : Obviously, the closing paragraph
Hope that help !
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
<?php
$doc = new DOMDocument;
$doc->loadHTMLFile('blah.html');
$content = $doc->documentElement;
$table = $content->getElementsByTagName('table')->item(0);
$delfirstTable = $content->removeChild($table);
echo $doc->saveHTML();
?>
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
preg_replace('#<div>[^<]+</div>#','',$html);
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
preg_replace('#<div>.+</div>#','',$html);
These aren't tested
PSEUDO CODE
function replaceMe($html_you_want_to_replace,$html_dom) {
return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}
HTML Before
<div>I'm Here</div><div>I'm next</div>
<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>
HTML After
<div>I'm next</div>
I know it's a hack job
I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML
<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>
I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.
My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.
$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....
The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).
Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?
(if it's not obvious, I don't consider regular expressions a valid solution here)
Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.
You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.
#Twan
You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.
Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.
I don't consider regular expressions a valid solution here
Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.
Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.
You can see the source here:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps
The unit test will show you how to use it:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps
My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.
When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.
The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.
#Alan Storm
Your comment on my other answer got me to thinking:
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)
Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:
$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);
obviously that code won't work, but you get the general idea?