Unable to use regex to search in PHP? - php

I'm trying to get the code of a html document in specific tags.
My method works for some tags, but not all, and it not work for the tag's content I want to get.
Here is my code:
<html>
<head></head>
<body>
<?php
$url = "http://sf.backpage.com/MusicInstruction/";
$data = file_get_contents($url);
$pattern = "/<div class=\"cat\">(.*)<\/div>/";
preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
var_dump($adsLinks);
foreach ($adsLinks as $i) {
echo "<div class='ads'>".$i[0]."</div>";
}
?>
</body>
</html>
The above code doesn't work, but it works when I change the $pattern into:
$pattern = "/<div class=\"date\">(.*)<\/div>/";
or
$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";
I can't see any different between these $pattern. Please help me find the error.
Thanks.

Use PHP DOM to parse HTML instead of regex.
For example in your case (code updated to show HTML):
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');
for ($i = 0; $i < $nodes->length; $i ++)
{
$x = $nodes->item($i);
if($x->getAttribute('class') == 'cat');
echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}

The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try
$pattern = "/<div class=\"cat\">(.*)<\/div>/s";
When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *
$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";
This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.

Related

Validation like Strip_tags for HTML Attributes [duplicate]

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

Remove all text within specific tags

I am interesting in removing all the text within the following tags:
<p class="wp-caption-text">Remove this text</p>
Can anybody give me an idea of how this can be done in php?
Thank you very much
Get rid of the tag and content inside of it:
$content = preg_replace('/<p\sclass=\"wp\-caption\-text\">[^<]+<\/p>/i', '', $content);
or if you want to preserve the tags:
$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '$1$2', $content);
As bit higher-level alternative to regular expressions.
You can process with DOM. You can match all nodes you're looking for with XPath //p[#class="wp-caption-text"].
For example:
$doc = new DOMDocument();
$doc->loadHTML($yourHTMLasString);
$xpath = new DOMXPath($doc);
$query = '//p[#class="wp-caption-text"]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$entry->textContent = '';
}
echo $doc->saveHTML();
Try this:
$string = '<p class="wp-caption-text">Remove this text</p>';
$pattern = '/(.*<p .*>).*(<\/p>.*)/';
$replacement = '$1$2';
echo preg_replace($pattern, $replacement, $string);
if its always the same tag you could simply do search for the string. use the position resulting to substring from it to the closing tag.
Or you could use a regular expression, there are good ones posted here that can help you.

Remove all attributes from html tags

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Matching everything between html <body> tags using PHP

I have a script that returns the following in a variable called $content
<body>
<p><span class=\"c-sc\">dgdfgdf</span></p>
</body>
I however need to place everything between the body tag inside an array called matches
I do the following to match the stuff between the body tag
preg_match('/<body>(.*)<\/body>/',$content,$matches);
but the $mathces array is empty, how could I get it to return everything inside the body tag
Don't try to process html with regular expressions! Use PHP's builtin parser instead:
$dom = new DOMDocument;
$dom->loadHTML($string);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$string = $dom->saveHTML();
You should not use regular expressions to parse HTML.
Your particular problem in this case is you need to add the DOTALL modifier so that the dot matches newlines.
preg_match('/<body>(.*)<\/body>/s', $content, $matches);
But seriously, use an HTML parser instead. There are so many ways that the above regular expression can break.
If for some reason you don't have DOMDocument installed, try this
Step 1. Download simple_html_dom
Step 2. Read the documentation about how to use its selectors
require_once("simple_html_dom.php");
$doc = new simple_html_dom();
$doc->load($someHtmlString);
$body = $doc->find("body")->innertext;

Categories