Get entire HTML, not just text with Goutte - php

I'm parsing a website and I have a problem, because it has some text split up with <br>, but when I use $node->text(), there's not even a space in place of that <br>.
How can I do to get the <br> too or at least replace it with a space?
The HTML is something like this:
<span>Some<br>Text</span>
Currently I get SomeText and I want it to be Some Text;
Thanks!

With Goutte you can use the html() method.
$node->html();
It will include the <br/> though. You could then use a strip_tags to remove the html tags.
$text = strip_tags($node->html());
There is probably a built in way of doing this with Goutte.

You can retrieve the HTML for that node instead of the text, and replace the <br> tags with spaces yourself. Something like this should do just fine:
str_replace('<br>', ' ', strip_tags($node->html(), '<br>'));
The strip_tags is there to remove anything that's not <br>, so it would be the equivalent of the text() method, but allow the line break tags. Then they can be replaced with spaces using str_replace. The above will transform this:
<span>Some<br>Text</span>
into this
Some Text

Related

Stuck with strip_tags why not working in my code?

I want to remove quotes and html tags from following string:
$mystring='Example string "I want to remove any html tags ABC<sub>DE</sub> from <p>similar</p> types of string?"';
I am using following scripts to remove it but it's doesn't work for me:
echo strip_tags(htmlentities($mystring,ENT_QUOTES));
I want to following output for above string:
Example string "I want to remove any html tags ABCDE from similar types of string?"
Why strip is not working?What I mess here?
Once you use htmlentities() on that string there are no tags left to strip because they have been converted, well, to their HTML entities.
At first use strip_tags function if you want remove html tags then use htmlentities as follow it should be work:
echo htmlentities(strip_tags($mystring),ENT_QUOTES);
Can you use this code
echo strip_tags($mystring);

remove <br> tag on ckeditor output

I have integraded a textarea box with ckeditor and each time I press enter on the backend side for a new line it outputs <br> on the front end, is there a way to remove <br> on the front end as I don't want the html tag output on the front end
I line of code looks like the following
echo "<strong>Sites Linked Out To</strong>: " . $row->sites_linked_out_to;
is there a way to remove the html <br> tag before it gets added to the database or after?
Thank you in advance
php strip tags function write in tags secondary parameter, keeping tags, other html tags clean
strip_tags($input, '<a><img><div><strong>');
more information for strip tags function http://php.net/manual/tr/function.strip-tags.php
You could call nl2br() when you insert the content.
This will replace any <br> with a new line \n
Note though; If you are actually seeing the tag its probably being url encoded somewhere. If you call nl2br() before this encoding takes place it should work.
You could also strip other undesired tags using strip_tags. Do this after replacing the new line.
define('ALLOWED_TAGS', '<p>,<strong>,<ul>,<li>,<ol>,<em>');
$sContent = strip_tags( nl2br($sContent), ALLOWED_TAGS);
Note though, this wont strip out <a href='#' onclick='DO_SOMETHING_BAD'>click me</a>
You could look at using a library such as html purifier to sanitise input. Or just ensure you sanitise all output correctly.
See nl2br() and strip_tags for more info.
I know I am late but this may help someone..
Use
htmlspecialchars_decode($your_string);

Replacing A String With a HTML Tag Using PHP

Im reading in a piece of HTML text. I want to remove all HTML tags except paragraphs and headings. To do this i use str_replace to replace the tags that i want with string placeholders. Then strip the HTML tags. Then finally replace the string placeholders with the original HTML code. This is where it is failing.
$Text = 'ManyENH3 different';
$updatedText = str_replace("ENH3", "</h3>", $Text);
The above code wont remove the ENH3 string. I have tried messing around and it doesnt work when there is no space before or after the word. I tried using preg_replace and it returns a blank string.
You can try :
$updatedText = strip_tags($Text, '<p><h1><h2><h3><h4><h5><h6>');

process text in html and reinsert to html structure

i want to grab text from HTML do some process and change to it and reinsert to that HTML code with php.
<p>This is my sentence <span>and more</span> also <strong>important</strong> part.</p>
What's the best method? Using preg_* ? how can i reinsert my text to HTML style ?
for example i want to remove all double or more spaces between words.
preg_replace('/\s+/', ' ', $myText);
but i want just applied in text of my html not html tags, attributes or etc ...
Have a look at DomDocument. It'll allow you to do some manipulation on your HTML.
http://www.php.net/manual/en/domdocument.loadhtml.php
EDIT
If you want to elaborate on exactly what you want to do with your HTML example, we might be able to provide a more specific answer :)
EDIT
To reflect the updated answer: the multiple spaces in HTML should collapse anyway, but if you want to remove them then you could try the following:
$result = preg_replace_callback('/(?<=\>)[\w\s]+(?=\<)/', function($match) {
return preg_filter('/\s+/', ' ', $match[0]);
}, $str);
I'm not a regex expert by any stretch so I'm sure there's a more elegant way to do this, but this might work for you nonetheless: first do a preg_replace_callback and use lookarounds to grab any text fragments between end and start tags. Then, pass the result through preg_filter (or preg_replace) to replace any multiple spaces as a single space.
Hope this helps/works :)

Using Smarty to strip P tags from my HTML

I'm using this code {$entry.entry|strip_tags} to strip tags, however I would just like to strip <p> tags and not all HTML tags.
Can someone help?
Thank you
If you want to strip ONLY <p> tags, try a simple regular-expression replacement:
{$entry.entry|regex_replace:"/(<p>|<p [^>]*>|<\\/p>)/":""}
This will replace <p>, </p> and all <p many attributes> strings with an empty string.
Let me know if it works. I tested the regular expression in PHP, not directly in Smarty.
You can do this using the regex_replace modifier:
{$foo = '<p>hello world</p><p some-att="ribute">foo</p>'}
{$foo|regex_replace:'#<\s*/?\s*p(\s[^>]*)?>#i':' '|escape}

Categories