i'm having issues with my preg_match solution.
I have the following html code:
<h1> Text marking test</h1><b> Chicago</b> - This is the text. Can this problem be solved by you?
I also have almost similar content:
Chicago - This is the text. Can this issue be solved by you?
All multiple spaces are gone and Problem has turned into Issue
I want to mark:
Chicago - This is the text. Can this
be solved by you?
So i get this:
<h1> Text marking test</h1><div class="marked"><b> Chicago</b> - This is the text. Can this</div> problem <div class="marked">be solved by you?</div>
I have the following regular expression pattern which works:
$string = preg_replace( "/(?im)(<b>)*Chicago([\s,.!?:;'\"]|<([^>]+)>)*-([\s,.!?:;'\"]|<([^>]+)>)*This([\s,.!?:;'\"]|<([^>]+)>)*is([\s,.!?:;'\"]|<([^>]+)>)*the([\s,.!?:;'\"]|<([^>]+)>)*text([\s,.!?:;'\"]|<([^>]+)>)*Can([\s,.!?:;'\"]|<([^>]+)>)*this([\s,.!?:;'\"]|<([^>]+)>)*/", '<div class="marked">' .'${0}'.'</div> , $string);
The problem is that the appending <b> tag could be any tag with any attribute and also optional.
It can only be the appending tag and not any tag before Chicago.
But somehow i constantly fail in my attempts.
Any help is greatly appreciated!
Maybe you can remove all the html tags before the text analysis by using "<[^>]*>" with replace_all, and then make a simpler text analysis regex.
I invite you to use multiple regex instead of making a big one, it's more confortable to locate a bug or update your program
Edit: I had misread your question and deleted my answer, but upon reading it again I think it might offer you some pointers on how to proceed. I don't completely understand the question, so please pardon the unsatisfactory answer.
You want to strip the text of HTML tags, as well as of multiple spaces. I would tackle these things separately:
function clean_text($text) {
$text = strip_tags($text);
$text = preg_replace('/\s{2,}/', ' ', $text);
return $text;
}
Use built-in functions where possible – no sense in re-inventing the wheel, especially as usually a lot of thought went into the functions. As for the second part, we match two or more whitespace characters and replace them with one space only.
Related
I am programmatically cleaning up some basic grammar in comments and other user submitted content. Capitalizing I, the first letter of sentence, etc. The comments and content are mixed with HTML as users have some options in formatting their text.
This is actually proving to bit a bit more challenging than expected, especially to someone new to PHP and regex.
If there a function like ucfirst that will ignore html to help capitalize sentences?
Also, any links or tutorials on cleaning up text like this in html, would be appreciated. Please leave anything you feel would help in the comments. thanks!
EDIT:
Sample Text:
<div><p>i wuz walkin thru the PaRK and found <strong>ur dog</strong>. <br />i hoPe to get a reward.<br /> plz call or text 7zero4 8two8 49 sevenseven</div>
I need for it to be (ultimately)
<div><p>I was walking through the park and found <strong>your dog<strong>. <p>I hope to get a reward.</p><p> Please call or text (704) 828-4977.</p>
I know this is going a little farther than the intended question, but my thought was to do this incrementally. ucfirst() is just one of many functions I was using to do one small cleanup at a time per scan. Even if I had to run the text 100 times through the filter, this runs on a cron run when the site has no traffic. I wish there was a discussion forum where this could continue as obviously there would be some great ideas on continuing the approach. Any thoughts on how to approach this as an overall project by all means please leave a comment.
I guess in the spirit of the question itself. ucfirst then would not be the best function for this as it could not take an argument list of things to ignore. A flag IGNORE_HTML would be great!
Given this is a PHP question, then the DOM parser recommended below sounds like the best answer? Thoughts?
You can also add a CSS pseudo-element to your desired elements like this:
div:first-letter {
text-transform: uppercase;
}
But you will probably need to change the way, you print out your senteces ( if you are printing them all in one huge tag ), since CSS lacks the ability to detect the start of a new sentence inside a single tag :(
You should probably use a DOM parser (either the built-in one or for example this one, which is really easy to use).
Walk through all of the text nodes in your HTML and perform the clean-up with preg_replace_callback, ucfirst and a regular expression like this one:
'/(\s*)([^.?!]*)/'
This will match a string of whitespace, and then as many non-sentence-ending-punctuation characters as possible. The actual sentence (starting with a letter, unless your sentence starts with ", which complicates things a bit) will then be found in the first capturing group.
But from your question, I suppose you are already doing something like the latter and your code is just choking on HTML tags. Here is some example code to get all text nodes with the second DOM parser I linked:
require 'simple_html_dom.php';
$html = new simple_html_dom();
$html->load($fullHtmlStr);
foreach($html->find('text') as $textNode)
$textNode = cleanupFunction($textNode);
$cleanedHtmlStr = $html->save();
In html it will be very difficult to do, as you will be building some kind of html parser. My suggestion would be to cleanup the text before it is transformed into html, at the moment you pull it out of the database. Or even better, cleanup the database once.
This should do it:
function html_ucfirst($s) {
return preg_replace_callback('#^((<(.+?)>)*)(.*?)$#', function ($c) {
return $c[1].ucfirst(array_pop($c));
}, $s);
}
Converts
<b>foo</b> to <b>Foo</b>,
<div><p>test</p></div> to <div><p>Test</p></div>,
but also bar to Bar.
Edit: According to your detailed question, you probably want to apply this function to each sentence. You will have to parse the text first (e.g. splitting by periods).
I have lots of text marked up like this:
<span class="section">[Section]</span>
I need to remove everything that has class="section" including span tags and text inside it. I'm looking for a regex or an alternativeto automate this task.
Any clues?
edit: Im up to anything that helps me solve this, i thought regex was the easier way. i'm coding in PHP.
Thanks.
If your section-class tags don't contain elements of the same type (e.g. you do not have spans containing spans) you can do this quite easily with a regex.
The following is the simplest:
$stripped = preg_replace('#<span class="section">.*?</span>#', '', $input);
This, if you need it, allows for any tag, any other attributes, and any other classes:
$stripped = preg_replace('#<(\w+)[^>]*class="[^"]*section[^"]*"[^>]*>.*?</\1>#', '', $input);
I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?
A better approach would be to use XML functions instead.
Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.
[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead
This question already has answers here:
Remove style attribute from HTML tags
(9 answers)
Closed 1 year ago.
I am using php to output some rich text. How can I strip out the inline styles completely?
The text will be pasted straight out of MS Word, or OpenOffice, and into a which uses TinyMCE, a Rich-Text editor which allows you to add basic HTML formatting to the text.
However I want to remove the inline styles on the tags (see below), but preserve the tags themselves.
<p style="margin-bottom: 0cm;">A patrol of Zograth apes came round the corner, causing Rosette to pull Rufus into a small alcove, where she pressed her body against his. “Sorry.” She said, breathing warm air onto the shy man's neck. Rufus trembled.</p>
<p style="margin-bottom: 0cm;"> </p>
<p style="margin-bottom: 0cm;">Rosette checked the coast was clear and pulled Rufus out of their hidey hole. They watched as the Zograth walked down a corridor, almost out of sight and then collapsed next to a phallic fountain. As their bodies hit the ground, their guns clattered across the floor. Rosette stopped one with her heel and picked it up immediately, tossing the other one to Rufus. “Most of these apes seem to be dying, but you might need this, just to give them a helping hand.”</p>
I quickly put this together, but for 'inline styles' (!) you will need something like
$text = preg_replace('#(<[a-z ]*)(style=("|\')(.*?)("|\'))([a-z ]*>)#', '\\1\\6', $text);
Here is a preg_replace solution I derived from Crozin's answer. This one allows for attributes before and after the style attribute fixing the issue with anchor tags.
$value = preg_replace('/(<[^>]*) style=("[^"]+"|\'[^\']+\')([^>]*>)/i', '$1$3', $value);
Use HtmlPurifier
You could use regular expressions:
$text = preg_relace('#<(.+?)style=(:?"|\')?[^"\']+(:?"|\')?(.*?)>#si', '<a\\1 \\2>', $text);
You can use: $content = preg_replace('/style=[^>]*/', '', $content);
You can also use PHP Simple HTML DOM Parser, as follows:
$html = str_get_html(SOME_HTML_STRING);
foreach ($html->find('*[style]') as $item) {
$item->style = null;
}
Couldn't you just use strip_tags and leave in the tags you want eg <p>, <strong> etc?
Why don't you just overwrite the tags. So you will have clean tags without inline styling.
I found this class very useful for doing strip attributes (especially where there's crazy MS Word formatting all through the text):
http://semlabs.co.uk/journal/php-strip-attributes-class-for-xml-and-html
I am did need to clear style from img tags and did resolved by this code:
$text = preg_replace('#(<img (.*) style=("|\')(.*?)("|\'))([a-z ]*)#', '<img \\2\\6', $text);
echo $text;
I would like such empty span tags (filled with and space) to be removed:
<span> </span>
I've tried with this regex, but it needs adjusting:
(<span>( |\s)*</span>)
preg_replace('#<span>( |\s)*</span>#si','<\\1>',$encoded);
Translating Kent Fredric's regexp to PHP :
preg_match_all('#<span[^>]*(?:/>|>(?:\s| )*</span>)#im', $html, $result);
This will match :
autoclosing spans
spans on multilines and whatever the case
spans with attributes
span with unbreakable spaces
Maybe you should about including spans containings only <br /> as well...
As usual, when it comes to tweak regexp, some tools are handy :
http://regex.larsolavtorvik.com/
.
qr{<span[^>]*(/>|>\s*?</span>)}
Should get the gist of them. ( Including XML style-self closing tags ie: )
But you really shouldn't use regex for HTML processing.
Answer only relevant to the context of the question that was visible before the formatting errors were corrected
I suppose these span are generated by some program, since they don't seem to have any attribute.
I am perplex why you need to put the space they enclose between angle brackets, but then again I don't know the final purpose of the code.
I think the solution is given by Kent: you have to make the match non-greedy: since you use dotall option (s), you will match everything between the first span and the last closing span!
So the answer should look like:
preg_replace('#<span>( |\s)*?</span>#si', '<$1>', $encoded);
(untested)
I've tried with this regex, but it needs adjusting:
In what way does the regex in the original question fail?
The problem comes when the span gets
nested like: <span><span> </span></span>
This is an example of why using regexes to parse HTML doesn't work particularly well. Depending on your regex flavor, this situation is either impossible to handle in a single pass or merely very difficult. I don't know PHP's regex engine well enough to say which category it falls into, but, if the only problem is that it takes out the inner <span> and leaves the outer one alone, then you may want to consider simply re-running your substitution repeatedly until it runs out of things to do.
If your only issue are nested span tags, you can run the search-and-replace with the regex you have in a loop until the regex no longer finds any matches.
This may not be a very elegant solution, but it'll perform well enough.
Here is my solution to nesting tags problems, still not complete but close...
$test="<span> <span>& nbsp; </span> test <span>& nbsp; <span>& nbsp; </span> </span> & nbsp;& nbsp; </span>";
$pattern = '#<(\w+)[^>]*>(& nbsp;|\s)*</\1>#im';
while(preg_match($pattern, $test, $matches, PREG_OFFSET_CAPTURE)!= 0)
{$test= preg_replace($pattern,'', $test);}
For short $test sentences the function works OK. Problem comes when trying with a long text. Any help will be appreciated...
Modifying e-satis' answer a bit:
function remove_empty_spans($html_replace)
{
$pattern = '/<span[^>]*(?:\/>|>(?:\s| )*<\/span>)/im';
return preg_replace($pattern, '', $html_replace);
}
This worked for me.