How to match the first image element without preceding text? - php

I need to select the the first image tag in a HTML string, but only if it does not have preceding text. So for example, it should match this:
<p><span><img src="some.jpg"></span></p>
But it should not match this:
<p>Text text text<span><img src="some.jpg"></span></p>
nor this:
<p><span>Text text text<img src="some.jpg"></span></p>
I've tryed something like:
/(<[^>]+>)<img/is
So that I can select the tags before the img tag, but I'm not able to exclude the text that can be in any tag preceding the img element.
Some thought?

Regex solution:
$regex='#^(<[^>]+>)*<img#i';
var_dump(preg_match($regex,'<p><span><img src="some.jpg"></span></p>'));
var_dump(preg_match($regex,'<p>Text text text<span><img src="some.jpg"></span></p>'));
var_dump(preg_match($regex,'<p><span>Text text text<img src="some.jpg"></span></p>'));
Outputs:
int(1)
int(0)
int(0)
Live demo
Edit:
DOM/XPath solution:
foreach(array('<p><span><img src="some.jpg"></span></p>',
'<p>Text text text<span><img src="some.jpg"></span></p>',
'<p><span>Text text text<img src="some.jpg"></span></p>') as $html)
{
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
var_dump($xpath->query('//img[string-length(//text())<=0]')->length);
}
Also outputs 1,0,0.
Live demo
Edit #2: The XPath solution still works, but it also eliminated the situation that text come after <img>. Since the question hinted that "preceding" means literally, I think Regex is a better tool here.

May be like this
$str = '
<p><span><img src="some1.jpg"></span></p>
<p><span>Text text text<img src="some2.jpg"></span></p>
<p><span>Text text text<img src="some3.jpg"></span></p>
<p><span><img src="some4.jpg"></span></p>';
preg_match_all('#<p>\s*<span>\s*<a.*(<img[^>]+>)#U', $str, $match);
echo '<pre>' . htmlspecialchars(print_r($match, 1)) . '</pre>';

$content = strip_tags($yourContent, '<p><img>');
preg_match_all("#<p>(<img[^>]+>)#U", $content, $out);
print_r($out);

Related

PHP: Replace DOMElement with DOMText node

I want to create some customised tags for translating, for instance
<trad>SOMETHING</trad>
I've also got a file with some $GLOBALS variable, like:
$GLOBALS['SOMETHING'] = 'Some text';
$GLOBALS['SOMETHINGELSE'] = 'Some other text';
So I've been able to show my translation in this way:
$string = "<trad>SOMETHING</trad>";
$string = preg_replace('/<trad[^>]*?>([\\s\\S]*?)<\/trad>/','\\1', $string);
echo $GLOBALS[$string];
This works perfectly, but when I've got something more complex like the following code, or when I have more occurences of this tag, I'm not able to let it work:
$string = "Lorem ipsum <trad>SOMETHING</trad> <h1>Hello</h1> <trad>SOMETHINGELSE</trad>";
I ideally want to create a new variale $string, replacing the values that I found into my tags and being able to show it with a simple echo.
So I want an output like this with:
echo $string;
//output: Lorem ipsum Some text <h1>Hello</h1> Some other text
Can you guys help me?
Regex is not a valid approach for treating HTMLstring. Here we are using DOMDocument instead of Regex to achieve desired output. The last step of strip_tags has been done to achieve desired output, there will no need in case a valid HTML string is supplied to loadHTML, in that case saveHTML($node) will do the job.
Try this code snippet here
<?php
ini_set('display_errors', 1);
libxml_use_internal_errors(true);
$array["SOMETHING"]="some text";
$array["SOMETHINGELSE"]="some text other";
$string = "Lorem ipsum <trad>SOMETHING</trad> <h1>Hello</h1> <trad>SOMETHINGELSE</trad>";
$domDocument = new DOMDocument();
$domDocument->loadHTML($string,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$results=$domDocument->getElementsByTagName("trad");
do
{
foreach($results as $result)
{
$result->parentNode->replaceChild($domDocument->createTextNode($array[trim($result->nodeValue)]),$result);
}
}
while($results->length>0);
echo strip_tags($domDocument->saveHTML(),"<h1>");

Regular expression to remove links with their inner text from a string with PHP

I have the following code:
$string = 'Try to remove the link text from the content links in it Try to remove the link text from the content testme Try to remove the link text from the content';
$string = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $string);
$result = preg_replace('/<a href="(.*?)">(.*?)<\/a>/', "\\2", $string);
echo $result; // this will output "I am a lot of text with links in it";
I am looking to merge these preg_replace lines. Please suggest.
You need to use DOM for these tasks. Here is a sample that removes links from this content of yours:
$str = 'Try to remove the link text from the content links in it Try to remove the link text from the content testme Try to remove the link text from the content';
$dom = new DOMDocument;
#$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$links = $xp->query('//a');
foreach ($links as $link) {
$link->parentNode->removeChild($link);
}
echo preg_replace('/^<p>([^<>]*)<\/p>$/', '$1', #$dom->saveHTML());
Since the text node is the only one in the document, the PHP DOM creates a dummy p node to wrap the text, so I am using a preg_replace to remove it. I think it is not your case.
See IDEONE demo

Remove from string

I have the following that I need removed from string in loop.
<comment>Some comment here</comment>
The result is from a database so the the content inside the comment tag is different.
Thanks for the help.
Figured it out. The following seems to do the trick.
echo preg_replace('~\<comment>.*?\</comment>~', '', $blog->comment);
This may be overkill, but you can use DOMDocument to parse the string as HTML, then remove the tags.
$str = 'Test 123 <comment>Some comment here</comment> abc 456';
$dom = new DOMDocument;
// Wrap $str in a div, so we can easily extract the HTML from the DOMDocument
#$dom->loadHTML("<div id='string'>$str</div>"); // It yells about <comment> not being valid
$comments = $dom->getElementsByTagName('comment');
foreach($comments as $c){
$c->parentNode->removeChild($c);
}
$domXPath = new DOMXPath($dom);
// $dom->getElementById requires the HTML be valid, and it's not here
// $dom->saveHTML() adds a DOCTYPE and HTML tag, which we don't need
echo $domXPath->query('//div[#id="string"]')->item(0)->nodeValue; // "Test 123 abc 456"
DEMO: http://codepad.org/wfzsmpAW
If this is only a matter of removing the <comment /> tag, a simple preg_replace() or a str_replace() will do:
$input = "<comment>Some comment here</comment>";
// Probably the best method str_replace()
echo str_replace(array("<comment>","</comment>"), "", $input);
// some comment here
// Or by regular expression...
echo preg_replace("/<\/?comment>/", "", $input);
// some comment here
Or if there are other tags in there and you want to strip out all but a few, use strip_tags() with its optional second parameter to specify allowable tags.
echo strip_tags($input, "<a><p><other_allowed_tag>");

Using PHP to remove a html element from a string

I am having trouble working out how to do this, I have a string looks something like this...
$text = "<p>This is some example text This is some example text This is some example text</p>
<p><em>This is some example text This is some example text This is some example text</em></p>
<p>This is some example text This is some example text This is some example text</p>";
I basically want to use something like preg_repalce and regex to remove
<em>This is some example text This is some example text This is some example text</em>
So I need to write some PHP code that will search for the opening <em> and closing </em> and delete all text in-between
hope someone can help,
Thanks.
$text = preg_replace('/([\s\S]*)(<em>)([\s\S]*)(</em>)([\s\S]*)/', '$1$5', $text);
In case if you are interested in a non-regex solution following would aswell:
<?php
$text = "<p>This is some example text This is some example text This is some example text</p>
<p><em>This is some example text This is some example text This is some example text</em></p>
<p>This is some example text This is some example text This is some example text</p>";
$emStartPos = strpos($text,"<em>");
$emEndPos = strpos($text,"</em>");
if ($emStartPos && $emEndPos) {
$emEndPos += 5; //remove <em> tag aswell
$len = $emEndPos - $emStartPos;
$text = substr_replace($text, '', $emStartPos, $len);
}
?>
This will remove all the content in between tags.
$text = '<p>This is some example text This is some example text This is some example text</p>
<p><em>This is the em text</em></p>
<p>This is some example text This is some example text This is some example text</p>';
preg_match("#<em>(.+?)</em>#", $text, $output);
echo $output[0]; // This will output it with em style
echo '<br /><br />';
echo $output[1]; // This will output only the text between the em
[ View output ]
For this example to work, I changed the <em></em> contents a little, otherwise all your text is the same and you cannot really understand if the script works.
However, if you want to get rid of the <em> and not to get the contents:
$text = '<p>This is some example text This is some example text This is some example text</p>
<p><em>This is the em text</em></p>
<p>This is some example text This is some example text This is some example text</p>';
echo preg_replace("/<em>(.+)<\/em>/", "", $text);
[ View output ]
Use strrpos to find the first element and
then the last element.
Use substr to get the part of string.
And then replace the substring with empty string from original string.
format: $text = str_replace('<em>','',$text);
$text = str_replace('</em>','',$text);

How to strip tags in PHP using regex?

$string = 'text <span style="color:#f09;">text</span>
<span class="data" data-url="http://www.google.com">google.com</span>
text <span class="data" data-url="http://www.yahoo.com">yahoo.com</span> text.';
What I want to do is get the data-url from all spans with the class data. So, it should output:
$string = 'text <span style="color:#f09;">text</span>
http://www.google.com text http://www.yahoo.com text.';
And then I want to remove all the remaining html tags.
$string = strip_tags($string);
Output:
$string = 'text text http://www.google.com text http://www.yahoo.com text.';
Can someone please tell me how this can be done?
If your string contains more than just the HTML snippet you show, you should use DOM with this XPath
//span/#data-url
Example:
$dom = new DOMDocument;
$dom->loadHTML($string);
$xp = new DOMXPath($dom);
foreach( $xp->query('//span/#data-url') as $node ) {
echo $node->nodeValue, PHP_EOL;
}
The above would output
http://www.google.com
http://www.yahoo.com
When you already have the HTML loaded, you can also do
echo $dom->documentElement->textContent;
which returns the same result as strip_tags($string) in this case:
text text
google.com
text yahoo.com text.
Try to use SimpleXML and foreach by the elements - then check if class attribute is valid and grab the data-url's
preg_match_all("/data/" data-url=/"([^']*)/i", $string , $urls);
You can fetch all URls a=by this way.
And you can also use simplexml as hsz mentioned
The short answer is: don't. There's a lovely rant somewhere around SO explaining why parsing html with regexes is a bad idea. Essentially it boils down to 'html is not a regular language so regular expressions are not adequate to parse it'. What you need is something DOM aware.
As #hsz said, SimpleXML is a good option if you know that your html validates as XML. Better might be DOMDocument::loadHTML which doesn't require well-formed html. Once your html is in a DOMDocument object then you can extract what you will very easily. Check out the docs here.

Categories