Why does this regex not work? - php

I want to grab any data between these two div headers, and the code below should work, is there something I am not seeing?
preg_match_all('$\<div class\=\"productDescriptionWrapper\"\>(.*?)\<div class\=\"emptyClear\"\>$', $source, $match);
Thanks in advance!

Cory, typically you should be using DOMDocument to do this. Using regex to parse html is not considered good practice because it contains so many hidden follies and overcomplicates.
http://php.net/manual/en/class.domdocument.php

Related

Get content between code tag return in array

I want to get the content between a code tag in a html document.
I tried forming it in preg_match...
Could anybody help me..
If you want to use preg_match, do:
preg_match("/<code>(.+?)<\/code>/is", $content, $matches);
Then access it with
$matches[1]
Though in general, you are going to find more use and better performance with a HTML Parser, which is the preferred method to Regular Expressions.
It's easier if you use phpQuery or QueryPath which allow:
print qp($html)->find("code")->text();
// looks for the <code> tag and prints the text content
If you want to try regular expressions for this, check out some of the tools listed in https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for help.

PHP regex use same letter

Im trying to do a regex where I can find all html tags, but for each one, each opening and closing tag must be the same. Heres what I mean: (Yes I only want max 3 letters)
preg_match_all("/\<[a-z]{1,3}\>(.*?)\<\/[a-z]{1,3}\>/", $string, $matches);
Where the 2 [a-z]{1,3} are, I want those to be the same, so it doesn't match <b> with <\i>, etc. Thanks... let me know if you need further explanation
Don't parse HTML with regex. Use PHP Tidy instead.
you really shouldn't be parsing *ml with regex because of problems with nested elements, but if this is any help:
preg_match_all("/<([a-z]{1,3})>(.*?)<\/\1>/", $string, $matches);
As Vivin Paliath said plus you can try to use PHP5's DomDocument with XPath
http://php.net/manual/en/class.domdocument.php

PHP: Filter specific html tags out of a given text

I googled a lot, for those kind of problems have been asked a lot in the past. But I didn't find anything to match my needs.
I have a html formatted text from a form. Just like this:
Hey, I am just some kind of <strong>formatted</strong> text!
Now, I want to strip all html tags, that I don't allow. PHP's built-in strip_tags() Method does that very well.
But I want to go a step further: I want to allow some Tags only inside or not inside of other tags. I also want to define my own XML Tags.
Another example:
I am a custom xml tag: <book><strong>Hello!</strong></book>. Ok... <strong>Hi!</strong>
Now, I want the <strong/> inside of <book/> to be stripped, but the <strong>Hi!</strong> can stay the way it is.
So, I want to define some rules of what I allow or don't allow, and want to have any filter do the rest.
Is there any easy way to do that? Regexp aren't what I'm looking for, for they can't parse html properly.
Regards, Jan Oliver
Don't think there is such a thing, I think not even HTML Purifier does that.
I suggest you parse the XHTML by hand using something like Simple HTML Dom.
Use a second argument to strip_tags, which is allowable tags.
$text = strip_tags($text, '<book><myxml:tag>');
I don't think there's a way to only strip certain tags if they're not inside other tags, without using regex.
Also, regex aren't not good at parsing HTML, but it's slow compared to the options. But that's not what you're doing here, anyways. You're going through the string and removing things you don't want. And for your complex requirement I think your only option is to use regex.
To be completely honest I think you should decide which tags are allowable and which aren't. Whether or not they are inside of other tags shouldn't matter at all. It's markup, not a script.
The second argument shows that you cal allow some tags:
string strip_tags ( string $str [, string $allowable_tags ] )
From php.net
I wrote my own Filter class based on the DOM classes of PHP. Look here: XHTMLFilter class

question regarding php function preg_replace

I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
#<p[^>]*>(.*)?</p>#sU
Explain :
I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
<p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
(.*)? : Everything (in "Ungreedy mode")
</p> : Obviously, the closing paragraph
Hope that help !
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
<?php
$doc = new DOMDocument;
$doc->loadHTMLFile('blah.html');
$content = $doc->documentElement;
$table = $content->getElementsByTagName('table')->item(0);
$delfirstTable = $content->removeChild($table);
echo $doc->saveHTML();
?>
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
preg_replace('#<div>[^<]+</div>#','',$html);
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
preg_replace('#<div>.+</div>#','',$html);
These aren't tested
PSEUDO CODE
function replaceMe($html_you_want_to_replace,$html_dom) {
return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}
HTML Before
<div>I'm Here</div><div>I'm next</div>
<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>
HTML After
<div>I'm next</div>
I know it's a hack job

Regex to match html attributes

I am trying to match a pattern so that I can retrieve a string from a website. Here is the string in Question:
<a title="Posts by ivek dhwWaVa"
href="http://www.example.com/author/ivek/"
rel="nofollow">ivek</a>
I am trying to match the string "ivek" in between the a tag and I want to do this for each post and relate it to the number of comments.
Firstly, what is the regex I should use the above so I can use it as an example for the rest. I have nothing so far:
$content = file_get_contents('http://www.example.com');
preg_match_all("", $content, $matches);
And how I would relate the comments to the authors name as there are many other authors on the website and also their own set of comments. Do I use divs to break this up? As each set of info is wrapped around this div:
<div id="post-54" class="excerpt">
Thanks all for any help!
Please let me be the first to introduce you to the most famous answer on Stack Overflow.
Regular expressions are not suited to parsing HTML. You really need an HTML parser, even for what might appear to be a simple task.
I recommend something like PHP Simple HTML DOM Parser.
You really shouldn't be looking to Regex to do the job:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?

Categories