Getting content of partial html in DomDocument - php

I have a string:
$string = 'some text <img src="www">';
I want to get the image source and the text.
Here is what I have:
$doc= new DOMDocument();
$doc->loadHTML($string);
$nodes=$doc->getElementsByTagName ('img');
From $nodes->item(0) I get the image source.
How can I get the the "some text"?

textContent, or with DOMXPaths $xpath->query('//text()')

For simple cases like this, try:
$doc->documentElement->textContent

You could make it like jQuery in javascript. Wrap the whole string with anything, and get this. Then you can get the TextNode, which contains this text.
$string = 'some text <img src="www">';
$string = '<div id="wrapper">' . $string . '</div>';
$nodes = $doc->getElementById('wrapper');

Related

PHP: Replace DOMElement with DOMText node

I want to create some customised tags for translating, for instance
<trad>SOMETHING</trad>
I've also got a file with some $GLOBALS variable, like:
$GLOBALS['SOMETHING'] = 'Some text';
$GLOBALS['SOMETHINGELSE'] = 'Some other text';
So I've been able to show my translation in this way:
$string = "<trad>SOMETHING</trad>";
$string = preg_replace('/<trad[^>]*?>([\\s\\S]*?)<\/trad>/','\\1', $string);
echo $GLOBALS[$string];
This works perfectly, but when I've got something more complex like the following code, or when I have more occurences of this tag, I'm not able to let it work:
$string = "Lorem ipsum <trad>SOMETHING</trad> <h1>Hello</h1> <trad>SOMETHINGELSE</trad>";
I ideally want to create a new variale $string, replacing the values that I found into my tags and being able to show it with a simple echo.
So I want an output like this with:
echo $string;
//output: Lorem ipsum Some text <h1>Hello</h1> Some other text
Can you guys help me?
Regex is not a valid approach for treating HTMLstring. Here we are using DOMDocument instead of Regex to achieve desired output. The last step of strip_tags has been done to achieve desired output, there will no need in case a valid HTML string is supplied to loadHTML, in that case saveHTML($node) will do the job.
Try this code snippet here
<?php
ini_set('display_errors', 1);
libxml_use_internal_errors(true);
$array["SOMETHING"]="some text";
$array["SOMETHINGELSE"]="some text other";
$string = "Lorem ipsum <trad>SOMETHING</trad> <h1>Hello</h1> <trad>SOMETHINGELSE</trad>";
$domDocument = new DOMDocument();
$domDocument->loadHTML($string,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$results=$domDocument->getElementsByTagName("trad");
do
{
foreach($results as $result)
{
$result->parentNode->replaceChild($domDocument->createTextNode($array[trim($result->nodeValue)]),$result);
}
}
while($results->length>0);
echo strip_tags($domDocument->saveHTML(),"<h1>");

Regular expression to remove links with their inner text from a string with PHP

I have the following code:
$string = 'Try to remove the link text from the content links in it Try to remove the link text from the content testme Try to remove the link text from the content';
$string = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $string);
$result = preg_replace('/<a href="(.*?)">(.*?)<\/a>/', "\\2", $string);
echo $result; // this will output "I am a lot of text with links in it";
I am looking to merge these preg_replace lines. Please suggest.
You need to use DOM for these tasks. Here is a sample that removes links from this content of yours:
$str = 'Try to remove the link text from the content links in it Try to remove the link text from the content testme Try to remove the link text from the content';
$dom = new DOMDocument;
#$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$links = $xp->query('//a');
foreach ($links as $link) {
$link->parentNode->removeChild($link);
}
echo preg_replace('/^<p>([^<>]*)<\/p>$/', '$1', #$dom->saveHTML());
Since the text node is the only one in the document, the PHP DOM creates a dummy p node to wrap the text, so I am using a preg_replace to remove it. I think it is not your case.
See IDEONE demo

How to strip tags in PHP using regex?

$string = 'text <span style="color:#f09;">text</span>
<span class="data" data-url="http://www.google.com">google.com</span>
text <span class="data" data-url="http://www.yahoo.com">yahoo.com</span> text.';
What I want to do is get the data-url from all spans with the class data. So, it should output:
$string = 'text <span style="color:#f09;">text</span>
http://www.google.com text http://www.yahoo.com text.';
And then I want to remove all the remaining html tags.
$string = strip_tags($string);
Output:
$string = 'text text http://www.google.com text http://www.yahoo.com text.';
Can someone please tell me how this can be done?
If your string contains more than just the HTML snippet you show, you should use DOM with this XPath
//span/#data-url
Example:
$dom = new DOMDocument;
$dom->loadHTML($string);
$xp = new DOMXPath($dom);
foreach( $xp->query('//span/#data-url') as $node ) {
echo $node->nodeValue, PHP_EOL;
}
The above would output
http://www.google.com
http://www.yahoo.com
When you already have the HTML loaded, you can also do
echo $dom->documentElement->textContent;
which returns the same result as strip_tags($string) in this case:
text text
google.com
text yahoo.com text.
Try to use SimpleXML and foreach by the elements - then check if class attribute is valid and grab the data-url's
preg_match_all("/data/" data-url=/"([^']*)/i", $string , $urls);
You can fetch all URls a=by this way.
And you can also use simplexml as hsz mentioned
The short answer is: don't. There's a lovely rant somewhere around SO explaining why parsing html with regexes is a bad idea. Essentially it boils down to 'html is not a regular language so regular expressions are not adequate to parse it'. What you need is something DOM aware.
As #hsz said, SimpleXML is a good option if you know that your html validates as XML. Better might be DOMDocument::loadHTML which doesn't require well-formed html. Once your html is in a DOMDocument object then you can extract what you will very easily. Check out the docs here.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

What is the best way to strip out all html tags from a string?

Using PHP, given a string such as: this is a <strong>string</strong>; I need a function to strip out ALL html tags so that the output is: this is a string. Any ideas? Thanks in advance.
PHP has a built-in function that does exactly what you want: strip_tags
$text = '<b>Hello</b> World';
print strip_tags($text); // outputs Hello World
If you expect broken HTML, you are going to need to load it into a DOM parser and then extract the text.
What about using strip_tags, which should do just the job ?
For instance (quoting the doc) :
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
will give you :
Test paragraph. Other text
Edit : but note that strip_tags doesn't validate what you give it. Which means that this code :
$text = "this is <10 a test";
var_dump(strip_tags($text));
Will get you :
string 'this is ' (length=8)
(Everything after the thing that looks like a starting tag gets removed).
strip_tags is the function you're after. You'd use it something like this
$text = '<strong>Strong</strong>';
$text = strip_tags($text);
// Now $text = 'Strong'
I find this to be a little more effective than strip_tags() alone, since strip_tags() will not zap javascript or css:
$search = array(
"'<head[^>]*?>.*?</head>'si",
"'<script[^>]*?>.*?</script>'si",
"'<style[^>]*?>.*?</style>'si",
);
$replace = array("","","");
$text = strip_tags(preg_replace($search, $replace, $html));

Categories