I am trying to match a pattern so that I can retrieve a string from a website. Here is the string in Question:
<a title="Posts by ivek dhwWaVa"
href="http://www.example.com/author/ivek/"
rel="nofollow">ivek</a>
I am trying to match the string "ivek" in between the a tag and I want to do this for each post and relate it to the number of comments.
Firstly, what is the regex I should use the above so I can use it as an example for the rest. I have nothing so far:
$content = file_get_contents('http://www.example.com');
preg_match_all("", $content, $matches);
And how I would relate the comments to the authors name as there are many other authors on the website and also their own set of comments. Do I use divs to break this up? As each set of info is wrapped around this div:
<div id="post-54" class="excerpt">
Thanks all for any help!
Please let me be the first to introduce you to the most famous answer on Stack Overflow.
Regular expressions are not suited to parsing HTML. You really need an HTML parser, even for what might appear to be a simple task.
I recommend something like PHP Simple HTML DOM Parser.
You really shouldn't be looking to Regex to do the job:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
Related
I can't seem to get this to work and I was hoping for some help.
I'm trying to capture the contents of a specific div (please save the DOM talk, for this specific purpose it doesn't really come into play.)
The problem is, I can't seem to get it to work if there is another div with attributes before it on the same line. I tried specifying only match if there's no > between <div and class="myClass", but I think I'm doing it wrong.
I'm still pretty mystified by regex.
/<div(?!>).*?class="myClass".*?>(.*?)<\/div>/mi
(semi) Working example: http://regex101.com/r/cW0lW6
Try
/<div(?=\s)(?:(?!>).)+?class="myClass".*?>(.*?)<\/div>/si
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML.
See: RegEx match open tags except XHTML self-contained tags
I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.
You can use this (simple way):
~<div[^>]+?class="myClass"[^>]*>(.*?)</div>~si
or this (more efficient way if you have a lot of attributes):
~<div(?>[^>c]++|\Bc|c(?!lass=))+class="myClass"[^>]*+>(.*?)</div>~si
Note that these patterns don't work if your div tag contains another div tag.
I'm using PHP preg_match function...
How can i fetch text in between tags. The following attempt doesn't fetch the value: preg_match("/^<title>(.*)<\/title>$/", $originalHTMLBlock, $textFound);
How can i find the first occurrence of the following element and fetch (Bunch of Texts and Tags):
<div id="post_message_">Bunch of Texts and Tags</div>
This is starting to get boring. Regex is likely not the tool of choice for matching languages like HTML, and there are thousands of similar questions on this site to prove it. I'm not going to link to the answer everyone else always links to - do a little search and see for yourself.
That said, your first regex assumes that the <title> tag is the entire input. I suspect that that's not the case. So
preg_match("#<title>(.*?)</title>#", $originalHTMLBlock, $textFound);
has a bit more of a chance of working. Note the lazy quantifier which becomes important if there is more than one <title> tag in your input. Which might be unlikely for <title> but not for <div>.
For your second question, you only have a working chance with regex if you don't have any nested <div> tags inside the one you're looking for. If that's the case, then
preg_match("#<div id=\"post_message_\">(.*?)</div>#", $originalHTMLBlock, $textFound);
might work.
But all in all, you'd better be using an HTML parser.
use this: <title\b[^>]*>(.*?)</title> (are you sure you need ^ and $ ?)
you can use the same regex expression <div\b[^>]*>(.*?)</div> assuming you don't have a </div> tag in your Bunch of Texts and Tags text. If you do, maybe you should take a look at http://code.google.com/p/phpquery/
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)
I know that the content starts off as <div id="content"> and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.
If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net
really good for this sort of thing.
Do not use regex, it can fail.
Use PHP's inbuilt DOM parse :
http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .
I googled a lot, for those kind of problems have been asked a lot in the past. But I didn't find anything to match my needs.
I have a html formatted text from a form. Just like this:
Hey, I am just some kind of <strong>formatted</strong> text!
Now, I want to strip all html tags, that I don't allow. PHP's built-in strip_tags() Method does that very well.
But I want to go a step further: I want to allow some Tags only inside or not inside of other tags. I also want to define my own XML Tags.
Another example:
I am a custom xml tag: <book><strong>Hello!</strong></book>. Ok... <strong>Hi!</strong>
Now, I want the <strong/> inside of <book/> to be stripped, but the <strong>Hi!</strong> can stay the way it is.
So, I want to define some rules of what I allow or don't allow, and want to have any filter do the rest.
Is there any easy way to do that? Regexp aren't what I'm looking for, for they can't parse html properly.
Regards, Jan Oliver
Don't think there is such a thing, I think not even HTML Purifier does that.
I suggest you parse the XHTML by hand using something like Simple HTML Dom.
Use a second argument to strip_tags, which is allowable tags.
$text = strip_tags($text, '<book><myxml:tag>');
I don't think there's a way to only strip certain tags if they're not inside other tags, without using regex.
Also, regex aren't not good at parsing HTML, but it's slow compared to the options. But that's not what you're doing here, anyways. You're going through the string and removing things you don't want. And for your complex requirement I think your only option is to use regex.
To be completely honest I think you should decide which tags are allowable and which aren't. Whether or not they are inside of other tags shouldn't matter at all. It's markup, not a script.
The second argument shows that you cal allow some tags:
string strip_tags ( string $str [, string $allowable_tags ] )
From php.net
I wrote my own Filter class based on the DOM classes of PHP. Look here: XHTMLFilter class