How can I get rid of these in a scrapped page using a PHP Regex code to go over the content and replace it with nothing?
<div id="news-id-245245" style="display:inline;">
I also do not have any other parts of the page where I use an ID for a div, so it will also work if a pattern simply removed all DIVs with an ID tag.
The id of the div is always as follows: "news-id-NUMBER"
Regarding your comment "I am not sure if this even makes sense": Since it works, it makes sense. I simplified your solution a bit; we don't need the parentheses around .*?. (We do need the ? unless there is only a single match in $content.)
$content = preg_replace('/<div id="news-id-.*?" style="display:inline;">/s', '', $content);
I don't think RegEx is an option here, you'll have to call it specifically. For example:
<?php
echo "<style>#news-id-NUMBER{display:none;}</style>";
?>
Related
it seems I have a very special problem as I couldn't find any related solution on the web so far. I have Smarty templates and I can pretty much remove all unnecessary whitespace with the trimwhitespace filter after making some slight modifications. However I cannot get rid of leading whitespace within tags. Please have a look at the following two examples:
<h1>A headline without any leading whitespace</h1>
<h1>
A headline like it would be formatted by an IDE
</h1>
My problem is that the Smarty trimwhitespace output filter does not trim the second example. When I place an icon before the headline using CSS :before, there is whitespace between the icon and the second example but not when applied to the first example.
Is it possible to use preg_replace to trim the second example so that it looks the same in HTML as the first example does?
(?<=<h1>)\s*|\s*(?=<\/h1>)
Try this.See demo.
http://regex101.com/r/hQ1rP0/61
This function should help.
function anyname($string)
{
return preg_replace(array('/\s{2,}/', '/\n/'), '', $string);
}
I think this won't count much combinations please, I need to remove from string a paragraph that has specific class (It is consistent)
Example:
<p class="special_class">Some content</p>
I want to remove the content of every paragraph which has special_class only. So I would like to run a regex that returns empty. I do not want to use a parser to do this please, i am using this in very little function inside my script.
Thank you :)
It is generally a bad idea to use regex for the HTML parsing. Take a look at Simple HTML DOM Parser instead to find out the specified and remove it.
If you want to use regex anyway, you could trt this instead:
preg_replace('/<p class="special_class">[\w\s]*<\/p>/', '<p class="special_class"></p>');
$result = preg_replace('%<p\s+class="special_class">.*?</p>%s', '', $subject);
should work. It will remove the entire paragraph from the opening to the closing p tag. It expects that the p tag is properly closed. But you already seem to know about the drawbacks of handling HTML with regexes...
i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>
WordPress spits posts in this format:
<h2>Some header</h>
<p>First paragraph of the post</p>
<p>Second paragraph of the post</p>
etc.
To get my cool styling on the first paragraph (it's one of those things that looks good only sparingly) I need to hook into the get_posts function to filter its output with a preg_replace.
The goal is to get the above code to look like:
<h2>Some header</h>
<p class="first">First paragraph of the post</p>
<p>Second paragraph of the post</p>
I have this so far but it's not even working (the error is: "preg_replace() [function.preg-replace]: Unknown modifier ']'")
$output=preg_replace('<p[^>]*>', '<p class="first">', $content);
I can't use CSS3 meta-selectors because I need to support IE6, and I can't apply the :first-line meta-selector (this is one that IE6 supports) on the parent container because it would hit the H2 instead of the first P.
You may find it easier and more reliable to use an HTML parser such as this one. HTML is notoriously difficult to parse reliably (technically, impossible) with regular expressions, and the parser will give you a very simple means to find the nodes you're interested in. The first page of the doc has a tab labelled "How to modify HTML elements".
Two right possibilities :
Do that in Javascript. Using jQuery, for example, it's a matter of one line : $("h2").next().addClass("first")
Use an HTML parser. Indeed, regexp are not a good tool to do what you want to do. Since loading a whole HTML parser for just this purpose is overkill, you'd really better be using Javascript.
The wrong way
Of course, in order to anwser the question, here is the best way I can't think of to make it happends with regexp. Though, I don't recommend it.
preg_replace('#(</h2>\s*<p[^>]*)>#im', '$1 class="first">', '<h2>Some header</h> <p>First paragraph of the post</p> <p>Second paragraph of the post</p> ');
What we do is:
using preg_replace so we can use advanced regexp to replace the code;
using "m" and "i" flag so the regexp does not bother about line break or case;
using </h2>\s* to match the closing "h2" tags and all the spaces/line breaks after;
using *<p[^>]* to match the "p" tag and its current attributs;
using parenthesis to save that;
using "$1" to replace to replace the matched string we the part we save;
adding the class and closing the ">".
The first draw back I can think of is that it doesn't handle the case where a class already exists.
Of, and by the way, you have <h2>...</h> instead of <h2>...</h2>. I don't know if it's a typo but I assumed it was. Replace in the regexp accordingly if it's not.
The problem is that the first character of the regex in a preg_* function is taken as a modifier delimiter. What you'd need is something like:
$output = preg_replace('~<p\b([^>]*)>~', '<p class="first" \1>', $content, 1);
This also puts back any extra attributes the <p> may have.
Overall, though, it's cleaner to do with CSS selectors and a JS fallback for IE.
EDIT: Added replacement limit and word break.
in this particular case regexp solution would be fairly easy
echo preg_replace('~</h2>\s*<p~', "$0 class='first'", $html);
Reading through the answers there are some that will work but all have drawbacks of either using an external parsing library or possibly matching tags other than the P tag or also matching its attributes.
I ended up using this solution with the str_replace_once function from here:
str_replace_once('<p>', '<p class="first">', $content);
Simple enough and it works just as intended. Here's full WordPress code snippet to filter the first paragraph any time the_content() is called:
add_filter('the_content', 'first_p_style');
function first_p_style($content) {
$output=str_replace_once('<p>', '<p class="first">', $content);
return ($output);
}
Thanks for all the answers!
I would like such empty span tags (filled with and space) to be removed:
<span> </span>
I've tried with this regex, but it needs adjusting:
(<span>( |\s)*</span>)
preg_replace('#<span>( |\s)*</span>#si','<\\1>',$encoded);
Translating Kent Fredric's regexp to PHP :
preg_match_all('#<span[^>]*(?:/>|>(?:\s| )*</span>)#im', $html, $result);
This will match :
autoclosing spans
spans on multilines and whatever the case
spans with attributes
span with unbreakable spaces
Maybe you should about including spans containings only <br /> as well...
As usual, when it comes to tweak regexp, some tools are handy :
http://regex.larsolavtorvik.com/
.
qr{<span[^>]*(/>|>\s*?</span>)}
Should get the gist of them. ( Including XML style-self closing tags ie: )
But you really shouldn't use regex for HTML processing.
Answer only relevant to the context of the question that was visible before the formatting errors were corrected
I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!
So the answer should look like:
preg_replace('#<span>( |\s)*?</span>#si', '<$1>', $encoded);
(untested)
I've tried with this regex, but it needs adjusting:
In what way does the regex in the original question fail?
The problem comes when the span gets
nested like: <span><span> </span></span>
This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner <span> and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.
If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.
This may not be a very elegant solution, but it'll perform well enough.
Here is my solution to nesting tags problems, still not complete but close...
$test="<span> <span>& nbsp; </span> test <span>& nbsp; <span>& nbsp; </span> </span> & nbsp;& nbsp; </span>";
$pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';
while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
{$test= preg_replace($pattern,'', $test);}
For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...
Modifying e-satis' answer a bit:
function remove_empty_spans($html_replace)
{
$pattern = '/<span[^>]*(?:\/>|>(?:\s| )*<\/span>)/im';
return preg_replace($pattern, '', $html_replace);
}
This worked for me.