I would like to take a block of code stored in a variable and replace the src of any image tags in there without disturbing the rest of the code block.
For example : the block of code might read :
<img src="image1.jpg">
I would like to change that to (using PHP) :
<img src="altimage.jpg">
I am currently using a solution I found using the PHP DOM module to change the image tag but the function returns just the changed img tag HTML without the rest of the HTML.
The function I am calling is as follows :
function replace_img_src($original_img_tag, $new_src_url) {
$doc = new DOMDocument();
$doc->loadHTML($original_img_tag);
$tags = $doc->getElementsByTagName('img');
if(count($tags) > 0)
{
$tag = $tags->item(0);
$tag->setAttribute('src', $new_src_url);
return $doc->saveXML($tag);
}
return false;
}
This, of course, just returns the changed img tag but strips the other HTML (such as the A tag) - I am passing the entire block of code to the function.
(BTW - It's good for me to have the false return for no image tags as well).
What am I missing here please ?
Many thanks in advance for any help.
You need to use return $doc->saveXML(); instead of return $doc->saveXML($tag);. See the documentation of saveXML:
saveXML ([ DOMNode $node [, int $options ]] )
node: Use this parameter to output only a specific node without XML declaration rather than the entire document.
Related
I am trying to modify (replace) with DomDocument the attribute of the new CSS class of the a tag (framed on the following screenshot in green) which I have checked well with the function stripos and try to replace its href attribute (i.e. replace the href attribute of the a tag with new CSS class) with str_ireplace:
My following PHP Code looks correct but doesn't work at all:
libxml_use_internal_errors(true);
$parser = new DOMDocument();
$parser->loadHTMLFile("https://fr.wikipedia.org/wiki/Alibaba_Group");
$get_a_tags = $parser->getElementsByTagName("a");
foreach ($get_a_tags as $get_a_tag) {
if (stripos($get_a_tag->getAttribute('class'), "new") !== false) {
$get_href_in_a_infobox = $get_a_tag->getAttribute('href');
$term = $get_a_tag->nodeValue;
$urlSearch = BASE_PATH."search.php?term=$term&type=sites";
// var_dump($urlSearch."<br><br>");
$wikipediaInfoboxTable = str_ireplace($get_href_in_a_infobox, $urlSearch, $wikipediaInfoboxTable);
}
}
So, how to succeed in replacing the value of the href attribute of all the a tags having the new CSS class by the value of the $urlSearch variable knowing that $wikipediaInfoboxTable contains the entire table of the infobox in which I would like to make my replacement ???
Thank you please help me.
Since you're using DomDocument you can try using the setAttribute method to change the href value:
foreach ($get_a_tags as $get_a_tag) {
//...
$get_a_tag->setAttribute('href', $urlSearch);
}
$newHtml = $parser->saveHTML();
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}
I am trying to read all links with in a given url.
here is code I am using :
$dom = new DomDocument();
#$dom->loadHTMLFile($url);
$urls = $dom->getElementsByTagName('a');
foreach ($urls as $url) {
echo $url->innertext ." => ".$url->getAttribute('href');
Script giving all links of given url.
But problem here is I am not able to get image links (image inside anchor tag)
First I tried with
$url->nodeValue
But it was giving anchor text having text values only.
I want to read both images and text links.
I want output in below formmat.
Input :
first link
<img src="imageone.jpg">
Current Output:
first link => link1.php
=>link2.php with warning (Undefined property: DOMElement::$innertext )
Required Output :
first link => link1.php
<img src="imageone.jpg">=>link2.php
innerText doesn't exist in PHP; it's a non-standard, Javascript extension to the DOM.
I think what you want is effectively an innerHTML property. There isn't a native way of achieving this. You can use the saveXML or, from PHP 5.3.6, saveHTML methods to export the HTML of each of the child nodes:
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
Note that you'll need to use saveXML before PHP 5.3.6
You could then call it as so:
echo innerHTML($url) ." => ".$url->getAttribute('href');
I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...
Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}
You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/