Say I have this:
<li class="one"><strong>String here: </strong><span class="one">
<!--googleoff: all-->
<strong>STRING TO GRAB</strong>
<!--googleon: all-->
</span></li>
How can I grab the STRING TO GRAB efficiently with RegEx? Keep in mind that this isn't the only text on the page, so /<strong>(.*)<\/strong>/ wouldn't work.
Thanks
There are two ways.
Dom classes: use the dom classes of PHP if the html is sort of a decent kind.
See:
- http://www.php.net/manual/en/domxpath.query.php
- http://www.php.net/manual/en/domdocument.loadhtml.php
Regex
If it's not really valid html or dom loading does not work, perhaps regex is a good solution.
I'm assuming that the <!--googleoff: all--> is always present, this might work, if not, perhaps you can supply some more comments on the specificity of the string:
$string = "yourhtmlstring";
$matches = array();
preg_match('/<!--googleoff: all-->\s+?<strong>(.+)<\/strong>\s+?<!--googleon: all-->/', $string, $matches)
var_dump($matches);
Final tip
To test the regex further: http://tinyurl.com/6gy6584
As said on the other answer, regex isn't the best answer for html (or xml)
/<strong>(.+?)<\/strong>/
Note the ? which makes the regex non greedy
Related
i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!
Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).
No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.
Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.
How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).
Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).
"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'
You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.
Im trying to do a regex where I can find all html tags, but for each one, each opening and closing tag must be the same. Heres what I mean: (Yes I only want max 3 letters)
preg_match_all("/\<[a-z]{1,3}\>(.*?)\<\/[a-z]{1,3}\>/", $string, $matches);
Where the 2 [a-z]{1,3} are, I want those to be the same, so it doesn't match <b> with <\i>, etc. Thanks... let me know if you need further explanation
Don't parse HTML with regex. Use PHP Tidy instead.
you really shouldn't be parsing *ml with regex because of problems with nested elements, but if this is any help:
preg_match_all("/<([a-z]{1,3})>(.*?)<\/\1>/", $string, $matches);
As Vivin Paliath said plus you can try to use PHP5's DomDocument with XPath
http://php.net/manual/en/class.domdocument.php
Here is the pattern that I want to match:
<div class="class">
I want to be able to capture this text
<span class="ptBrand">
This is what I am doing:
$pattern='{<div class="productTitle">[\n]<((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)>([^\n]*)</a>[\n]<span class="ptBrand">}';
preg_match($pattern, $data, $matches,PREG_OFFSET_CAPTURE);
print_r($matches);
It prints:
Array ( )
As a general rule, regular expressions are a really poor means of parsing HTML. They're unreliable and tend to end up being really complicated. A far more robust solution is to use an HTML parser. See Parse HTML With PHP And DOM.
As for your expression, I don't see <div class="productTitle" anywhere in the source so I'd start there. Likewise you're trying to parse a URL but there's no mention of the anchor tag (either directly or through a sufficient wildcard) so it'll fail there too. Basically that expression doesn't look anything like the HTML you're trying to parse.
... Or this:
preg_match('/\s*([^>]+)\s*<\/a/',$string,$match);
Trims it too.
The pattern:
/<div class="class">\s*([^<]+)/m
Would get the link and text roughly, but using the DOM library would be a much better method.
You can try this:
([\s\S]*?)
I would like such empty span tags (filled with and space) to be removed:
<span> </span>
I've tried with this regex, but it needs adjusting:
(<span>( |\s)*</span>)
preg_replace('#<span>( |\s)*</span>#si','<\\1>',$encoded);
Translating Kent Fredric's regexp to PHP :
preg_match_all('#<span[^>]*(?:/>|>(?:\s| )*</span>)#im', $html, $result);
This will match :
autoclosing spans
spans on multilines and whatever the case
spans with attributes
span with unbreakable spaces
Maybe you should about including spans containings only <br /> as well...
As usual, when it comes to tweak regexp, some tools are handy :
http://regex.larsolavtorvik.com/
.
qr{<span[^>]*(/>|>\s*?</span>)}
Should get the gist of them. ( Including XML style-self closing tags ie: )
But you really shouldn't use regex for HTML processing.
Answer only relevant to the context of the question that was visible before the formatting errors were corrected
I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!
So the answer should look like:
preg_replace('#<span>( |\s)*?</span>#si', '<$1>', $encoded);
(untested)
I've tried with this regex, but it needs adjusting:
In what way does the regex in the original question fail?
The problem comes when the span gets
nested like: <span><span> </span></span>
This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner <span> and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.
If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.
This may not be a very elegant solution, but it'll perform well enough.
Here is my solution to nesting tags problems, still not complete but close...
$test="<span> <span>& nbsp; </span> test <span>& nbsp; <span>& nbsp; </span> </span> & nbsp;& nbsp; </span>";
$pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';
while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
{$test= preg_replace($pattern,'', $test);}
For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...
Modifying e-satis' answer a bit:
function remove_empty_spans($html_replace)
{
$pattern = '/<span[^>]*(?:\/>|>(?:\s| )*<\/span>)/im';
return preg_replace($pattern, '', $html_replace);
}
This worked for me.