Difficulties with the function preg_match_all - php

I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);

Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

Related

Replace enclosing Apostrophs with HTML tags but not inside <code> blocks

Goal: Modifying an HTML string that contains apostrophs for wrapping code inline (like Stackoverflow is doing it). But the same time having <code> blocks that can also contain apostrophs which should stay unchanged.
Example:
<p>This is my `inline code`, it can be replaced and tag-wrapped.</p>
<p><code>This text contains `apostrophs`, but should `not` be changed.</code></p>
This regex I am using for converting all wrapping apostrophs to <code> elements:
// replace apostroph with incorporating <code> tag
$content = preg_replace('/(.+?)\`(.+?)\`/', '$1<code class="inlinecode">$2</code>', $content);
Required:
Change the regex, so that it does not convert the apostroph if it is withing a <code> block.
Disclaimer: I tried for several hours to read the HTML string, use PHP's DOM parser, extract all nodes of type code, change their content, write them back, then found out that nodeValue is removing all HTML tags (especially the line breaks). Then tried several solutions found online, still not working... Now I am falling back to regex, even against the odds.
FYI, how I tried it the DOM way:
$code_blocks = $dom->getElementsByTagName('code');
foreach($code_blocks as $codenode) {
// nodeValue strips HTML tags, we need to hack
$nodevalue_html = $codenode->ownerDocument->saveXML($codenode);
// replace, i.e. custom-store each apostroph with '~~~APO~~~' so that they survive
$nodevalue_html = preg_replace('/`/', '~~~APO~~~', $nodevalue_html);
// $codenode->textValue = $nodevalue_html; // fail
// $codenode->nodeValue = $nodevalue_html; // fail
// ...
}
// html to string
$html_new = $dom->saveHTML();
$html_new = preg_replace('/~~~APO~~~/', '`', $html_new);
I wished I could use Markdown like Stackoverflow, but I still need to deal with HTML.
Using an XPath query to avoid text nodes that have a code element as ancestor:
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::code)][contains(.,"`")]');
foreach ($textNodes as $textNode) {
$parts = (function($text) { yield from explode('`', $text); })($textNode->nodeValue);
$frag = $dom->createDocumentFragment();
do {
$frag->appendChild($dom->createTextNode($parts->current()));
$parts->next();
if ( $parts->valid() ) {
$codeElt = $dom->createElement('code');
$codeElt->appendChild($dom->createTextNode($parts->current()));
$frag->appendChild($codeElt);
$parts->next();
}
} while ($parts->valid());
$textNode->parentNode->replaceChild($frag, $textNode);
}
echo $dom->saveHTML();
demo
demo for php < 7.0
I believe the only way is to explode and reassemble the string:
$html_string = '....................'; // contains apostrophes and <code>...</code> blocks
$delim = "<code>";
$closing_tag = "</code>";
$explode = explode($delim, $html_string);
foreach($explode as &$ex) {
$closing_tag_pos = strpos($ex, $closing_tag);
if ($closing_tag_pos !== false) {
$pre_closing_tag = substr($ex, 0, $closing_tag_pos);
$post_closing_tag = substr($ex, $closing_tag_pos);
$ex = $pre_closing_tag . preg_replace('/`/', '~~~APO~~~', $post_closing_tag);
}
}
$mapped_html_string = implode($delim, $explode);

Validation like Strip_tags for HTML Attributes [duplicate]

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

Reg expression to remove empty Tags (any of them)?

I like to remove any empty html tag which is empty or containing spaces.
something like to get:
$string = "<b>text</b><b><span> </span></b><p> <br/></p><b></b><font size='4'></font>";
to:
$string ="<b>text</b>=;
Here is an approach with DOM:
// init the document
$dom = new DOMDocument;
$dom->loadHTML($string);
// fetch all the wanted nodes
$xp = new DOMXPath($dom);
foreach($xp->query('//*[not(node()) or normalize-space() = ""]') as $node) {
$node->parentNode->removeChild($node);
}
// output the cleaned markup
echo $dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
);
This would output something like
<body><b>text</b></body>
XML documents require a root element, so there is no way to omit that. You can str_replace it though. The above can handle broken HTML.
If you want to selectively remove specific nodes, adjust the XPath query.
Also see
How do you parse and process HTML/XML in PHP?
Locating the node by value containing whitespaces using XPath
function stripEmptyTags ($result)
{
$regexps = array (
'~<(\w+)\b[^\>]*>\s*</\\1>~',
'~<\w+\s*/>~'
);
do
{
$string = $result;
$result = preg_replace ($regexps, '', $string);
}
while ($result != $string);
return $result;
}
$string = "<b>text</b><b><span> </span></b><p> <br/></p><b></b><font size='4'></font>";
echo stripEmptyTags ($string);
You will need to run the code multiple times in order to do this only with regular expressions.
the regex that does this is:
/<(?:(\w+)(?: [^>]*)?`> *<\/$1>)|(?:<\w+(?: [^>]*)?\/>)/g
But for example on your string you have to run it at least twice. Once it will remove the <br/> and the second time will remove the remaining <p> </p>.

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.
Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.
Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));
This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.
$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.
You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).
<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.
HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Categories