php regex to add class to images without class - php

I'm looking for a php regex to check if a image don't have any class, then add "img-responsive" to that image's class.
thank you.

Instead of looking to implement a regular expression, make effective use of DOM instead.
$doc = new DOMDocument;
$doc->loadHTML($html); // load the HTML data
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
if (!$img->hasAttribute('class'))
$img->setAttribute('class', 'img-responsive');
}

I would be tempted to do this in JQuery. That offers all the functionality you need in a few lines.
$(document).ready(function(){
$('img').not('img[class]').each(function(e){
$(this).addClass('img-responsive');
});
});

If you have the output in PHP then a HTML parser is the way to do it. Regular expressions will always fail, in the end. If you don't want to use a parser, but you have the HTML code you can try to do it with plain and simple PHP code:
function addClassToImagesWithout($html)
// this function does what you want, given well-formed html
{
// cut into parts where the images are
$parts = explode('<img',$html);
foreach ($parts as $key => $part)
{
// spilt at the end of tags, the image args are in the first bit
$bits = explode('>',$part);
// does it not contain a class
if (strpos($bits[0],'class=') !== FALSE)
{
// insert the class
$bits[0] .= " class='img-responsive'";
}
// recombine the bits
$part[$key] = implode('>',$bits);
}
// recombine the parts and return the html
return implode('<img',$parts);
}
this code is untested and far from perfect, but it shows that regular expressions are not needed. You will have to add in some code to catch exceptions.
I must stress that this code, just like regular expressions will ultimately fail when, for instance, you have something like id='classroom', title='we are a class apart' or similar. To do a better job you should use a parser:
http://htmlparsing.com/php.html

Related

PHP: Simple HTML DOM Parser - how to get the element which has certain tag name?

In PHP I'm using the Simple HTML DOM Parser class.
I have a HTML file which has multiple and diferents tags.
In this HTML there is an element like this:
<a name="10418"><b> Hospitalist (Family Practitioner)</b></a>
So I would like to find that 'a' element with has name="10418"
I've tried this with no luck, because I only want to get that string.
$html_object = str_get_html($url);
$html_object=$html_object->find('a');
foreach ($html_object as $o) {
$a= $o->find("b");
echo $a[0];
}
Try:
$anchor = $html_object->find('a[name=10418]', 0);
echo $anchor->plaintext;
Working DEMO
Try another library called Tag Parse.It's simple and efficient.
$dom = new TagParse\TagDomRoot($html);
$a = $dom->find('a[name="10418"]');
I think it's fast and cost less memory than simple_html_dom.

Remove tags with Simple HTML DOM parser [duplicate]

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Simple HTML Dom: How to remove elements?

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Ignore html tags in preg_replace

How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

Extract Image Sources from text in PHP - preg_match_all required

I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...
Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}
You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/

Categories