process text in html and reinsert to html structure - php

i want to grab text from HTML do some process and change to it and reinsert to that HTML code with php.
<p>This is my sentence <span>and more</span> also <strong>important</strong> part.</p>
What's the best method? Using preg_* ? how can i reinsert my text to HTML style ?
for example i want to remove all double or more spaces between words.
preg_replace('/\s+/', ' ', $myText);
but i want just applied in text of my html not html tags, attributes or etc ...

Have a look at DomDocument. It'll allow you to do some manipulation on your HTML.
http://www.php.net/manual/en/domdocument.loadhtml.php
EDIT
If you want to elaborate on exactly what you want to do with your HTML example, we might be able to provide a more specific answer :)
EDIT
To reflect the updated answer: the multiple spaces in HTML should collapse anyway, but if you want to remove them then you could try the following:
$result = preg_replace_callback('/(?<=\>)[\w\s]+(?=\<)/', function($match) {
return preg_filter('/\s+/', ' ', $match[0]);
}, $str);
I'm not a regex expert by any stretch so I'm sure there's a more elegant way to do this, but this might work for you nonetheless: first do a preg_replace_callback and use lookarounds to grab any text fragments between end and start tags. Then, pass the result through preg_filter (or preg_replace) to replace any multiple spaces as a single space.
Hope this helps/works :)

Related

Get entire HTML, not just text with Goutte

I'm parsing a website and I have a problem, because it has some text split up with <br>, but when I use $node->text(), there's not even a space in place of that <br>.
How can I do to get the <br> too or at least replace it with a space?
The HTML is something like this:
<span>Some<br>Text</span>
Currently I get SomeText and I want it to be Some Text;
Thanks!
With Goutte you can use the html() method.
$node->html();
It will include the <br/> though. You could then use a strip_tags to remove the html tags.
$text = strip_tags($node->html());
There is probably a built in way of doing this with Goutte.
You can retrieve the HTML for that node instead of the text, and replace the <br> tags with spaces yourself. Something like this should do just fine:
str_replace('<br>', ' ', strip_tags($node->html(), '<br>'));
The strip_tags is there to remove anything that's not <br>, so it would be the equivalent of the text() method, but allow the line break tags. Then they can be replaced with spaces using str_replace. The above will transform this:
<span>Some<br>Text</span>
into this
Some Text

How to prevent PHP str_replace replace text in html tags?

I am developing a website where user can upload their texts. For managerial purpose, I want to
change all the text "apple" to <a href="https://apple.com">apple<a> dynamically by php.
I am using str_replace('apple','apple') Now
However, the word "apple" might already been linked to an external source by users. In this case, will mess up the original link.
Say the page has the following :
apple
my code will change it to
<a href="...">apple</a>
Is there any way I can identify if a certain "apple" was in an a tag or other html tags already?
Thank you
Use DOMDocument to turn the HTML into a DOM you can work with. Then, iterate over all text nodes, making the replacements.
Why not use an if statement to look for the <a href="..">, else do your replacement?
Would all occurrences of "Apple" be in regular sentences (i.e. preceded or followed by spaces or newlines)? If so, you could try something like this:
str_replace(' apple', ' apple, $string);
If that wont do what you need, do a catch-all str_replace and then use preg_match with regex to get clean up any nested links. Something along the lines of this which would preserve the original link (though I don't recommend using regex to parse HTML).
preg_match('/\\\3', $string);

Strip html tags and content with specified attributes

I have lots of text marked up like this:
<span class="section">[Section]</span>
I need to remove everything that has class="section" including span tags and text inside it. I'm looking for a regex or an alternativeto automate this task.
Any clues?
edit: Im up to anything that helps me solve this, i thought regex was the easier way. i'm coding in PHP.
Thanks.
If your section-class tags don't contain elements of the same type (e.g. you do not have spans containing spans) you can do this quite easily with a regex.
The following is the simplest:
$stripped = preg_replace('#<span class="section">.*?</span>#', '', $input);
This, if you need it, allows for any tag, any other attributes, and any other classes:
$stripped = preg_replace('#<(\w+)[^>]*class="[^"]*section[^"]*"[^>]*>.*?</\1>#', '', $input);

preg_replace on text only and not inside href's

I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?
A better approach would be to use XML functions instead.
Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.
[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

Categories