PHP,preg_match,Regular Expression. What am I doing wrong? - php

Here is the pattern that I want to match:
<div class="class">
I want to be able to capture this text
<span class="ptBrand">
This is what I am doing:
$pattern='{<div class="productTitle">[\n]<((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)>([^\n]*)</a>[\n]<span class="ptBrand">}';
preg_match($pattern, $data, $matches,PREG_OFFSET_CAPTURE);
print_r($matches);
It prints:
Array ( )

As a general rule, regular expressions are a really poor means of parsing HTML. They're unreliable and tend to end up being really complicated. A far more robust solution is to use an HTML parser. See Parse HTML With PHP And DOM.
As for your expression, I don't see <div class="productTitle" anywhere in the source so I'd start there. Likewise you're trying to parse a URL but there's no mention of the anchor tag (either directly or through a sufficient wildcard) so it'll fail there too. Basically that expression doesn't look anything like the HTML you're trying to parse.

... Or this:
preg_match('/\s*([^>]+)\s*<\/a/',$string,$match);
Trims it too.

The pattern:
/<div class="class">\s*([^<]+)/m
Would get the link and text roughly, but using the DOM library would be a much better method.

You can try this:
([\s\S]*?)

Related

Regular Expression using Preg_Match

I'm using PHP preg_match function...
How can i fetch text in between tags. The following attempt doesn't fetch the value: preg_match("/^<title>(.*)<\/title>$/", $originalHTMLBlock, $textFound);
How can i find the first occurrence of the following element and fetch (Bunch of Texts and Tags):
<div id="post_message_">Bunch of Texts and Tags</div>
This is starting to get boring. Regex is likely not the tool of choice for matching languages like HTML, and there are thousands of similar questions on this site to prove it. I'm not going to link to the answer everyone else always links to - do a little search and see for yourself.
That said, your first regex assumes that the <title> tag is the entire input. I suspect that that's not the case. So
preg_match("#<title>(.*?)</title>#", $originalHTMLBlock, $textFound);
has a bit more of a chance of working. Note the lazy quantifier which becomes important if there is more than one <title> tag in your input. Which might be unlikely for <title> but not for <div>.
For your second question, you only have a working chance with regex if you don't have any nested <div> tags inside the one you're looking for. If that's the case, then
preg_match("#<div id=\"post_message_\">(.*?)</div>#", $originalHTMLBlock, $textFound);
might work.
But all in all, you'd better be using an HTML parser.
use this: <title\b[^>]*>(.*?)</title> (are you sure you need ^ and $ ?)
you can use the same regex expression <div\b[^>]*>(.*?)</div> assuming you don't have a </div> tag in your Bunch of Texts and Tags text. If you do, maybe you should take a look at http://code.google.com/p/phpquery/

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.

Get content between code tag return in array

I want to get the content between a code tag in a html document.
I tried forming it in preg_match...
Could anybody help me..
If you want to use preg_match, do:
preg_match("/<code>(.+?)<\/code>/is", $content, $matches);
Then access it with
$matches[1]
Though in general, you are going to find more use and better performance with a HTML Parser, which is the preferred method to Regular Expressions.
It's easier if you use phpQuery or QueryPath which allow:
print qp($html)->find("code")->text();
// looks for the <code> tag and prints the text content
If you want to try regular expressions for this, check out some of the tools listed in https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for help.

Regex: Match html tag only if it contains a specific class id

Match an html tag using perl regex in php.
Want the tag to match if it contains "class=details" somewhere in the open tag.
Wanting to match <table border="0" class="details"> not <table border="0">
Wrote this to match it:
'#<table(.+?)class="details"(.+?)>#is'
The <table(.+?) creates a problem since it matches the first table tag it finds only stopping the match when it finds class="details" no matter how far down the code it occurs.
I think this logic would fix my problem:
"Match <table but only if it contains class="details" before the next >"
How can I write this?
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as phpQuery and use it as such:
function get_first_image($html){
$dom = phpQuery::newDocument($html);
$first_img = $dom->find('img:first');
if($first_img !== null) {
return $first_img->attr('src');
}
return null;
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
Simple example on how to solve your problem with phpQuery:
$dom = phpQuery::newDocument($html);
$matching_tags = $dom->find('.details');
You will probably need a Positive Look Ahead of some form, as a very crude one that clearly has its limitations...
<table(?=[^>]*class="details")[^>]*>
HTML is not parseable ( reliably ) using regular expressions. There are few simple cases which have a solution but they are exceptions. I think that your case is unsolvable using regex but I am not sure
You should work with it using XML tools and XML parsers like XPath for searching and testing your conditions. There is very simple to write the expression which matches your case. I don't know how to build XML tree and execute XPath query in PHP but XPath expression is
//table[#class='details']
You could possibly use a Regex like the following:
<\/?table[^>]*(class="details")*>
But the above users are correct in saying that it would be much better to use a xml/html type parser to find your item.

regex: match string only if not part of a tag

I am trying to match a string only if it is not part of an html tag.
For example when searching for the string: "abc".
abc def should match
<p> foo bar foo abc foo bar</p> should match
but
foo should not match.
Thanks for the help!
I really wouldn't use regexps to match HTML, since HTML isn't regular and there are a load of edge cases to trip you up. For all but the simplest cases I'd use an HTML parser (e.g. this one for PHP).
Brian has got a point, anyway, if you wish to use a regex, that one suits you inputs:
.*>[^<]*abc[^<]*<.*
I'm quite convinced that any regex is going to break on some CDATA sections.
What you're looking for is a DOM parser. That will strip out all the HTML and provide you the plain text of the page you're examining, which you can then match on. Not sure what your use case is, but I'm not assuming you're not manipulating the DOM, or else you'd be using JavaScript.
If you're just extracting information, parse the page using something like The Simple HTML DOM Parser, and then match against the plain text you can get from the parsed object.
While I too agree with Brian's comment, i often do quick and dirty parsing with regular expressions, and for your case, i'd use something like this:
"serialize" the data
s/[\r\n]//
s/<!\[CDATA\[.*?]]>//
s/</\n</
s/>/>\n/
then simply filter all lines that begin with <
s/^<.*//
What you're left with is just the text (and possibly a lot of white-space). Though this is less about regular expressions and more about search and replace.

Categories