PHP regex use same letter - php

Im trying to do a regex where I can find all html tags, but for each one, each opening and closing tag must be the same. Heres what I mean: (Yes I only want max 3 letters)
preg_match_all("/\<[a-z]{1,3}\>(.*?)\<\/[a-z]{1,3}\>/", $string, $matches);
Where the 2 [a-z]{1,3} are, I want those to be the same, so it doesn't match <b> with <\i>, etc. Thanks... let me know if you need further explanation

Don't parse HTML with regex. Use PHP Tidy instead.

you really shouldn't be parsing *ml with regex because of problems with nested elements, but if this is any help:
preg_match_all("/<([a-z]{1,3})>(.*?)<\/\1>/", $string, $matches);

As Vivin Paliath said plus you can try to use PHP5's DomDocument with XPath
http://php.net/manual/en/class.domdocument.php

Related

[php]how to extract a single simple text from a long html source

i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!
Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).
No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

Regular expression syntax problem

$pattern='`<a\s+[^>]*(href=([\'\"]).*\\2)[^>]*>([^<]*)</a>`isU';
And I want to change ([^<]*) this to search for </a> not only < cause <img> tag could be inside <a> tag.
Can anyone help, I'm lousy at regex.
You can use a PHP parser to do this. I wouldn't use Regex at all.
You can try:
http://simplehtmldom.sourceforge.net/
Although I think PHP has a DOM parser built in.
Changing ([^<]*)to a ungreedy match all (.*?) might do the trick
([^<]*) could be changed to ((?:[^<]|<(?!/a>))*), which uses a negative lookahead to match non-< characters or < characters which are not followed by /a>. See it in action here.
HOWEVER, as stated many times over already, this is not a good way to parse HTML. Firstly, it's horribly inefficient, and secondly, what happens if you have nested tags, such as <a><a></a></a>? While this may not happen with hyperlinks, it's common among many other HTML elements.

How can I grab this with RegEx?

Say I have this:
<li class="one"><strong>String here: </strong><span class="one">
<!--googleoff: all-->
<strong>STRING TO GRAB</strong>
<!--googleon: all-->
</span></li>
How can I grab the STRING TO GRAB efficiently with RegEx? Keep in mind that this isn't the only text on the page, so /<strong>(.*)<\/strong>/ wouldn't work.
Thanks
There are two ways.
Dom classes: use the dom classes of PHP if the html is sort of a decent kind.
See:
- http://www.php.net/manual/en/domxpath.query.php
- http://www.php.net/manual/en/domdocument.loadhtml.php
Regex
If it's not really valid html or dom loading does not work, perhaps regex is a good solution.
I'm assuming that the <!--googleoff: all--> is always present, this might work, if not, perhaps you can supply some more comments on the specificity of the string:
$string = "yourhtmlstring";
$matches = array();
preg_match('/<!--googleoff: all-->\s+?<strong>(.+)<\/strong>\s+?<!--googleon: all-->/', $string, $matches)
var_dump($matches);
Final tip
To test the regex further: http://tinyurl.com/6gy6584
As said on the other answer, regex isn't the best answer for html (or xml)
/<strong>(.+?)<\/strong>/
Note the ? which makes the regex non greedy

Get content between code tag return in array

I want to get the content between a code tag in a html document.
I tried forming it in preg_match...
Could anybody help me..
If you want to use preg_match, do:
preg_match("/<code>(.+?)<\/code>/is", $content, $matches);
Then access it with
$matches[1]
Though in general, you are going to find more use and better performance with a HTML Parser, which is the preferred method to Regular Expressions.
It's easier if you use phpQuery or QueryPath which allow:
print qp($html)->find("code")->text();
// looks for the <code> tag and prints the text content
If you want to try regular expressions for this, check out some of the tools listed in https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world for help.

PHP,preg_match,Regular Expression. What am I doing wrong?

Here is the pattern that I want to match:
<div class="class">
I want to be able to capture this text
<span class="ptBrand">
This is what I am doing:
$pattern='{<div class="productTitle">[\n]<((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)>([^\n]*)</a>[\n]<span class="ptBrand">}';
preg_match($pattern, $data, $matches,PREG_OFFSET_CAPTURE);
print_r($matches);
It prints:
Array ( )
As a general rule, regular expressions are a really poor means of parsing HTML. They're unreliable and tend to end up being really complicated. A far more robust solution is to use an HTML parser. See Parse HTML With PHP And DOM.
As for your expression, I don't see <div class="productTitle" anywhere in the source so I'd start there. Likewise you're trying to parse a URL but there's no mention of the anchor tag (either directly or through a sufficient wildcard) so it'll fail there too. Basically that expression doesn't look anything like the HTML you're trying to parse.
... Or this:
preg_match('/\s*([^>]+)\s*<\/a/',$string,$match);
Trims it too.
The pattern:
/<div class="class">\s*([^<]+)/m
Would get the link and text roughly, but using the DOM library would be a much better method.
You can try this:
([\s\S]*?)

Categories