PHP regexp parse HTML

PHP regexp parse HTML - php

My regexp:
<([a-zA-Z0-9]+)>[\na-zA-Z0-9]*<\/\1+>
my string:
<div>
<f>
</f>
</div>
the result is:
array(2
0 => array(1
0 => <f>
</f>
)
1 => array(1
0 => f
)
)
why it is capturing <f></f>, and ignoring the first <div> ?

The answer is USE A PARSER INSTEAD (sorry for my shouting). While it is sometimes faster to use a regular expression to obtain an ID or URL string, html tags need a rather error-prone way of understanding via regex. Consider the following code, isn't that much more beautiful than druidic characters with special meanings?
<?php
$str = "
<container>
<div class='someclass' data='somedata'>
<f>some content here</f>
</div>
</container>";
$xml = simplexml_load_string($str);
echo $xml->div->f; // some content here
$attributes = $xml->div->attributes();
print_r($attributes); // class and data as keys
?>

I'd say it's because your second character class statement tries to find 0 or more of the characters before the ending tag comes, and that doesn't match with the <div>...</div> block.

Related

Regex to find anchor tag not working accurately

I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);

You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.

If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1

Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.

php only remove wrapper tag

How can I remove only the wrapper tag with preg_replace.
For example: I want to remove p tag from this:
$html = "<p><div><p>aaaaaa</p></div></p>";
Output should be: <div><p>aaaaaa</p></div>
If input is
$html = "<p>aaaaaa</p><div>bbbb</div>";
Output should be: <p>aaaaaa</p><div>bbbb</div>
I tried using this regex: '/<p[^>]*>(.*)<\/p[^>]*>/i' but it replaced all p tags.

Here is a regex approach using a recursive pattern.
Code: (Demo)
$htmls = [
"<p><div><p>aaaaaa</p></div></p>",
"<div><p>aaaaaa</p></div>",
"<p>aaaaaa</p><div>bbbbbb</div>",
"<p>aaaaaa</p><div>bbbbbb</div><p>cccccc</p>",
"<p>aaaaaa</p><p>bbbbbb</p>",
"<p>hello<p>aaaaaa</p></p>",
"<p><p>aaaaaa</p></p>"
];
foreach ($htmls as $i => $html) {
$without_ptags = preg_replace('~<p>(?:(?R)|.*?)*</p>~', '', $html,2, $count);
if ($without_ptags === '' && $count == 1) {
echo "$i => ", substr($html, 3, -4);
}else{
echo "$i => not wrapped in p tags";
}
echo "\n---\n";
}
Output:
0 => <div><p>aaaaaa</p></div>
---
1 => not wrapped in p tags
---
2 => not wrapped in p tags
---
3 => not wrapped in p tags
---
4 => not wrapped in p tags
---
5 => hello<p>aaaaaa</p>
---
6 => <p>aaaaaa</p>
---
*Note Parsing HTML with regex is not recommended. If I can come up with a clever DomDocument approach, I'll add it to my answer.
Until then, my code uses a recursive pattern to replace <p>...</p> substrings with an empty string. (Pattern Demo) preg_replace() stores the number of replacements made in $count. If the output string is completely empty and $count is 1 then it can be reasoned that the html string was fully nested in a single, parent <p> tag. After making this determination, substr() is used to remove the leading <p> and the trailing </p>. *note: A replacement limit of 2 is used because 2 or more replacements constitutes a disqualified html string regardless of the output to $without_ptags.

Extract data from HTML tags [ Automatically get start and end tags ] by php [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
Want to make a script that will automatically get content from html tags (start and end) and store them into an array.
Example:
Input:
$str = <p>This is a sample <b>text</b> </p> this is out of tags.<p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>
output:
$blocks[0] = <p>This is a sample <b>text</b> </p>
$blocks[1] = <p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>
NB: the first block start with <p> so must be stop at </p>, the second block again start with <p> but it has another start and end paragraph[<p></p>] between this, and stop when find </p> . That means i want to put all of the data and inner tags between start and end tags.

I'll try to provide an answer to this, although this solution does not give you exactly what your are looking for, since nested <p> tags are not valid HTML. Using PHP's DOMDocument, you can extract the paragraph tags like this.
<?php
$test = "<p>This is a sample <b>text</b> </p> this is out of tags.<p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>";
$html = new DOMDocument();
$html->loadHTML($test);
$p_tags = array();
foreach ($html->getElementsByTagName('p') as $p) {
$p_tags[] = $html->saveHTML($p);
}
print_r($p_tags);
?>
After throwing some warnings at you because of the invalid tag nesting, the output should be the following:
Array
(
[0] => <p>This is a sample <b>text</b> </p>
[1] => <p>This is </p>
[2] => <p>another text</p>
)

you can use Simple Html Dom library to do this. Here is the example.
require_once('simple_html_dom.php');
$html = " <p>This is a sample <b>text</b> </p> this is out of tags.<p>This is <p>another text</p>for same aggregate <i>tags</i>.</p>";
$html = str_get_html($html);
$p = $html->find('p');
$contentArray = array();
foreach($p as $element)
$contentArray[] = $element->innertext; //You can try $element->outertext to get the output with tag. ie. <p>content</p>
print_r($contentArray);
your output is like this:
Array
(
[0] => This is a sample <b>text</b>
[1] => This is
[2] => another text
)

PHP split content when a HTML element is found

I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?

Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)

It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.

If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}

Regex to replace reg trademark

I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!

Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol

I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.

Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");

I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP regexp parse HTML - php

My regexp: <([a-zA-Z0-9]+)>[\na-zA-Z0-9]*<\/\1+> my string: <div> <f> </f> </div> the result is: array(2 0 => array(1 0 => <f> </f> ) 1 => array(1 0 => f ) ) why it is capturing <f></f>, and ignoring the first <div> ?

I'd say it's because your second character class statement tries to find 0 or more of the characters before the ending tag comes, and that doesn't match with the <div>...</div> block.

Related

Regex to find anchor tag not working accurately

php only remove wrapper tag

Extract data from HTML tags [ Automatically get start and end tags ] by php [duplicate]

PHP split content when a HTML element is found

Regex to replace reg trademark

Categories

Resources