Regex to get HTML elements and attributes

Regex to get HTML elements and attributes - php

I'm very new at regular expressions: I want to preg_match all elements in a html dom, that has a data-editable attribute. All other attributes of those elements should be matched also, so i can reuse them later:
<div class="teaser" id="teaser" data-editable><p>Content</p></div>
After matching i want those elements with data-editable attribute to have specific css classes and add another element inside. So only block-level parents should be matched.
<div class="teaser editable" id="teaser"><button>edit</button><p>Content</p></div>
Here's what i've got:
<(div|p).*(data-editable).[^>]+>(.*?)<\/\1>
I know, i'm totally wrong with that - this one matches also elements that does not have that data-editable attribute set because of that .+ inside. But how to match the different attributes without losing their values?

You shouldn't go through HTML with regex (as shown here). What you should do would be to use an HTML parsing framework, such as the PHP Simple DOM Parser to process your HTML pages.
According to their documentation, you can do what you want through this: $html->find("div[data-editable]", 0)->outertext

Since HTML isn't a regular language, you're better of using a DOM parser. Much easier, too

Related

dynamically populating array with html content without stripping tags

I have some HTML Content.
Eg.
<p>Ask a question</p><p>Wait for an answer</p><p>Vote up an Answer</p>
I want to use php to input each paragraph/div or any other html element separately as elements of an array
$arr[0]="<p>Ask a question</p>";
$arr[1]="<p>Wait for an answer</p>";
I want to do the above task dynamically.

Lots of ways to do this. My first approach would be to use preg_match_all():
Assume your html is in $html:
preg_match_all( '|<p>.+?</p>|si', $html, $matches );
$matches will then be an array-of-arrays of 1 element x N elements, where N is the number of matches.
$matches[0][0] == '<p>Ask a question</p>';
$matches[0][1] == '<p>Wait for an answer</p>';
...
Edit: this can be generalized to match other tags, but a DOM parser should be used instead after the complexity requirements outweigh what regular expressions are capable of parsing.
If you want to match a fixed set of non-nestable tags, this approach will work, but if the desired tags are nestable or self-closing, then regular expressions are not the way to go and using a simple DOM parser will be the right solution.

How to get text between div tags that contain class, style etc attributes before id attribute. I need to use regular expression

Hi I'm using this regular expression for getting the text inside test
<div id = "test">text</div>
$regex = "#\<div id=\"test\"\>(.+?)\<\/div\>#s";
But if the scenario change for e.g.
<div class="testing" style="color:red" .... more attributes and id="test">text</div>
or
<div class="testing" ...some attributes... id="test".... some attributes....>text</div>
or
<div id="test" .........any number of attributes>text</div>
then the above regex will not be able to extract the text between div tag. In 1st case if more attributes are placed in front of id attribute of div tag i.e id attribute being the last attribute the above regex don work. In second case id attribute is between some attributes and in 3rd case it is the 1st attribute of div tag.
Can I have a regex that can match the above 3 conditions so as to extract the text between div tags by specifying ID ONLY. Have to use regex only :( .
Please Help
Thank you....

I would strongly recommend an HTML parser to save yourself from the never-ending grief of trying to write a regular expression to parse HTML/XML.

I suggest you obtain that DOM element via xpath, the xpath expression for that element is:
//div[#class="testing"]
All this can be done with the PHP DOMDocument extension or alternatively with the SimpleXML extension. Both ship in 99,9% with PHP, same as with the regular expression extension, some rough example code (demo):
echo simplexml_import_dom(#DOMDocument::loadHTML($html))
->xpath('//div[#class="testing"]')[0];
Xpath is a specialized language for querying elements and data from XML documents, where as regular expression is a language for more simple strings.
Edit: Same for ID: http://codepad.viper-7.com/h1FlO0
//div[#id="test"]
I guess you understand quite quickly how these simple xpath expressions work.

Here's the answer with DOM (kind of crudish but works)
$aPieceOfHTML = '<div class="testing" id="test" style="color:red">This is my text blabla<div>';
$doc = new DOMDocument();
$doc->loadHTML($aPieceOfHTML);
$div = $doc->getElementsByTagName("div");
$mytext = $div->item(0)->nodeValue;
echo $mytext;
Here's the Cthulhu way:
$regex = '/(?<=id\=\"test\"\>).*(?=\<\/div\>)/';
DISCLAIMER
By no means I guarantee this will work in every case (far from it). In fact, this will fail if:
id="test" is not the last tag attribute
if there is a space (or anything) between id="test" and the closing >.
If the div tag is not properly closed </div>
If the tags are written in uppercase
If tag attributes are written in uppercase
I don't know... this will probably fail in more cases
I could try to write a more complex regex but I don't think I could come up with something much better than this. Besides, it kind of seems a waste of time when you have other tools built in PHP that can parse HTML so much better.

I don't know if you still need this, but the RegEx below works for all of the give scenarios in your question.
(!?(<.*?>)|[^<]+)\s*
https://regex101.com/r/DAObw0/1
The matching group can be accessed with:
const [_, group1, group2] = myRegex.Exec(input)

simple html dom: how get a tag without certain attribute

I want to get the tags with "class" attribute equal to "someclass" but only those tags that hasn't defined the attribute "id".
I tried the following (based on this answer) but didn't work:
$html->find('.someclass[id!=*]');
Note:
I'm using Simple HTML DOM class and in the basic documentation that they give, I didn't find what I need.

From the PHP Simple HTML DOM Parser Manual, under the How to find HTML elements?, we can read:
[!attribute] Matches elements that don't have the specified attribute.
Your code would become:
$html->find('.someclass[!id]');
This will match elements with a class someClass that do not have an id attribute.
My original answer was based on the selection of elements just like we would with jQuery since the Simple HTML DOM Parser claims to support them on their main page where we can read:
Find tags on an HTML page with selectors just like jQuery.
My sincere apologies to those who were offended by my original answer and expressed their displeasure in the comments!

Simple HTML DOM class does not support CSS3 pseudo classes which is required for negative attribute matching.
It is simple to work around the limitation without much trouble.
$nodes = array_filter($html->find('.something'), function($node){return empty($node->id);});

Regex: Match html tag only if it contains a specific class id

Match an html tag using perl regex in php.
Want the tag to match if it contains "class=details" somewhere in the open tag.
Wanting to match <table border="0" class="details"> not <table border="0">
Wrote this to match it:
'#<table(.+?)class="details"(.+?)>#is'
The <table(.+?) creates a problem since it matches the first table tag it finds only stopping the match when it finds class="details" no matter how far down the code it occurs.
I think this logic would fix my problem:
"Match <table but only if it contains class="details" before the next >"
How can I write this?

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as phpQuery and use it as such:
function get_first_image($html){
$dom = phpQuery::newDocument($html);
$first_img = $dom->find('img:first');
if($first_img !== null) {
return $first_img->attr('src');
}
return null;
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
Simple example on how to solve your problem with phpQuery:
$dom = phpQuery::newDocument($html);
$matching_tags = $dom->find('.details');

You will probably need a Positive Look Ahead of some form, as a very crude one that clearly has its limitations...
<table(?=[^>]*class="details")[^>]*>

HTML is not parseable ( reliably ) using regular expressions. There are few simple cases which have a solution but they are exceptions. I think that your case is unsolvable using regex but I am not sure
You should work with it using XML tools and XML parsers like XPath for searching and testing your conditions. There is very simple to write the expression which matches your case. I don't know how to build XML tree and execute XPath query in PHP but XPath expression is
//table[#class='details']

You could possibly use a Regex like the following:
<\/?table[^>]*(class="details")*>
But the above users are correct in saying that it would be much better to use a xml/html type parser to find your item.

PHP DOM Get Tag Before N-th Table

Let's say the HTML contains 15 table tags, before each table there is a div tag with some text inside. I need to get the text from the div tag that is directly before the 10th table tag in the HTML markup. How would I do that?
The only way I can think of is to use explode('<table', $html) to split the HTML into parts and then get the last div tag from the 9th value of the exploded array with regular expression. Is there a better way?
I'm reading through the PHP DOM documentation but I cannot see any method that would help me with this task there.

You load your HTML into a DOMDocument and query it with this XPath expression:
//table[10]/preceding-sibling::div[1]
This would work for the following layout:
<div>Some text.</div>
<table><!-- #1 --></table>
<!-- ...nine more... -->
<div>Some other text.</div> <!-- this would be selected -->
<table><!-- #10 --></table>
<!-- ...four more... -->
XPath is capable of doing really complex node lookups with ease. If the above expression does not yet work for you, probably very little is required to make it do what you want.
HTML is structured data represented as a string, this is something substantially different from being a string. Don't give in to the temptation of doing stuff like this with string handling functions like explode(), or even regex.

If you don't feel like learning xpath, you can use the same old-school DOM walking techniques you would use with JavaScript in the browser.
document.getElementsByTagName('table')[9]
then crawl your way up the .previousSibling values until you find one that isn't a TextNode and is a div
I've found that PHP's DOMDocument works pretty well with non-perfect HTML and then once you have the DOM I think you can even pass that into a SimpleXML object and work with it XML-style even though the original HTML/XHTML structure wasn't perfect.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to get HTML elements and attributes - php

Since HTML isn't a regular language, you're better of using a DOM parser. Much easier, too

Related

dynamically populating array with html content without stripping tags

How to get text between div tags that contain class, style etc attributes before id attribute. I need to use regular expression

simple html dom: how get a tag without certain attribute

Regex: Match html tag only if it contains a specific class id

PHP DOM Get Tag Before N-th Table

Categories

Resources