I would like to extract the href value (without any library), how can I do it?
<dm:link rel="uql" href="URL-URL-URL-URL" type="application/rss+xml"/>
Thanks.
You're usually better off parsing it using the likes of simplexml or dom libraries. Using a regex for this is bug-prone.
Though arguably a DOM parser would be the best solution, this small task could be done quite reliably with a regex.
Also, you'd need to import the info for the dm namespace if using a library.
preg_match('/\shref="(?<href>[^"]+)"/', $str, $match);
try this
(?<=href\=\")[\w:\-\=\/\:\d\?\.\#]*
should work
Related
Are there any HTML parsers written in PHP that use DOMDocument for parsing?
I'm basically looking for a wrapper class that provides nicer and more natural API than DOMDocument, which is problematic to work with.
There is SmartDOMDocument, its fixes a few things like encoding and outputting as string.
I don't know of any other wrappers, but you can use an alternative to DOMDocument:
PHPQuery
PHP Simple HTML DOM Parser
Ganon
Also, do you realize DOMXPath exists?
It makes it way easier to retrieve values.
http://www.phpbuilder.com/columns/PHP_HTML_DOM_parser/PHPHTMLDOMParser.cc_09-07-2011.php3 is another possibility.
I want to grab any data between these two div headers, and the code below should work, is there something I am not seeing?
preg_match_all('$\<div class\=\"productDescriptionWrapper\"\>(.*?)\<div class\=\"emptyClear\"\>$', $source, $match);
Thanks in advance!
Cory, typically you should be using DOMDocument to do this. Using regex to parse html is not considered good practice because it contains so many hidden follies and overcomplicates.
http://php.net/manual/en/class.domdocument.php
I am really new to php so still getting to grips.
I am using this bit of code to pull in world market feed.
<?php
$homepage = file_get_contents('http://www.news4trader.com/cgi-bin/google_finance.cgi?widget=worldmarkets');
echo $homepage;
?>
I just wanted to know how I can strip the google links out of it so the market titles are just static text.
All help is very much appreciated.
You can use the PHP function strip_tags() like this:
<?php
$homepage = file_get_contents('http://www.news4trader.com/cgi-bin/google_finance.cgi?widget=worldmarkets');
echo strip_tags($homepage, "<style><div><table><tr><td>");
?>
Just include all the tags you want to allow in the second argument.
You can use preg_replace() with a regex pattern to filter it out. This is simple, but not very flexible if you want to work more with your loaded data. PHP provides a nice library called DOMDocument (http://php.net/manual/de/class.domdocument.php), with which you can work very flexible on your document.
you could use "The DOMDocument class" it's used for exactly that.
http://php.net/manual/en/class.domdocument.php
you should have the basic idea of oop.
if you struggle with it, you could use strpos, and substr and such, but that would be hard.
strpos: http://php.net/manual/en/function.strpos.php
substr: http://php.net/manual/en/function.substr.php
you can use regex something like this:
/<a (.+google.+)>.+<\/a>/
This matches link that has any attribute or value with word google in it
I want to get the content between a code tag in a html document.
I tried forming it in preg_match...
Could anybody help me..
If you want to use preg_match, do:
preg_match("/<code>(.+?)<\/code>/is", $content, $matches);
Then access it with
$matches[1]
Though in general, you are going to find more use and better performance with a HTML Parser, which is the preferred method to Regular Expressions.
It's easier if you use phpQuery or QueryPath which allow:
print qp($html)->find("code")->text();
// looks for the <code> tag and prints the text content
If you want to try regular expressions for this, check out some of the tools listed in https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for help.
I want to get the <form> from the site. but between the form part in this situation, there still have mnay other html code. how to remove them? I mean how to use php just regular the and part from the site?
$str = file_get_contents('http://bingphp.codeplex.com');
preg_match_all('~<form.+</form>~iUs', $str, $match);
var_dump($match);
You should not use regular expressions for extracting HTML content. Use a DOM parser.
E.g.
$doc = new DOMDocument();
$doc->loadHTMLFile("http://bingphp.codeplex.com");
$forms = $doc->getElementsByTagName('form');
Update: If you want to remove the forms (not sure if you meant that):
for($i = $forms.length;$i--;) {
$node = $forms->item($i);
$node->parentNode->removeChild($node);
}
Update 2:
I just noticed that they have one form that wraps the whole body content. So this way or another, you will get the whole page actually.
The regex problem lies in the greedyness. For such cases .+? is advisable.
But what #Felix said. While a regular expression is workable for HTML extraction, you often look for something specific, and should thus rather parse it. It's also much simpler if you use QueryPath:
$str = file_get_contents('http://bingphp.codeplex.com');
print qp($str)->find("form")->html();
The best way i can think of is to use the Simple HTML DOM library with PHP to get the form(s) from the HTML page using DOM queries.
It is a little more convenient than using built-in xml parsers like simplexml or domdocument.
You can find the library here.
Normally you should use DOM to parse HTML, but in this case the web site is very far from being standard HTML, with some of the code being modified in place by javascript. It can therefore not be loaded into the DOM object. This might be intentional, a way of obfuscating the code.
In any case, it is not so much your RE (although using a non-greedy match would help), but the design of the site itself which is preventing you from parsing out what you want.