domdocument regex replace tag - php

hello I have a php regex code like this :
preg_replace('~<div\s*.*?(?:\s*class\s*=\s*"(.*?)"|id\s*=\s*"(.*?)\s*)?>~i','<div align="center" class="$1" id="$2">', "html source code");
now what I want to do is to replace all tags in the source html code and then keep only the class and id from the div tag plus add align="center" to it:
examples:
<div style="border:none;" class="classbutton"> will be replaced to <div align="center" class="classbutton">
<div style="border:none;" class="classbutton" id="idstyle"> will be replaced to <div align="center" class="classbutton" id="idstyle">
I already tried many codes using php regex but nothing seems to be working for me. so if someone can help me or give me a domdocument code to fix this issue.
thanks in advance.

Here is some snippet that should get you going:
$html = '<body><div style="border:none;" class="classbutton" id="idstyle">Some text</div></body>'; // Sample HTML string
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div[#class="classbutton"]'); // Get all DIV tags with class "classbutton"
foreach($divs as $div) { // Loop through all DIVs found
$div->setAttribute('align', 'center'); // Set align="center"
$div->removeAttribute('style'); // Remove "style" attribute
}
echo $dom->saveHTML(); // Save HTML (use $html = $dom->saveHTML();)
See IDEONE demo

Related

appendXML stripping out img element

I need to insert an image with a div element in the middle of an article. The page is generated using PHP from a CRM. I have a routine to count the characters for all the paragraph tags, and insert the HTML after the paragraph that has the 120th character. I am using appendXML and it works, until I try to insert an image element.
When I put the <img> element in, it is stripped out. I understand it is looking for XML, however, I am closing the <img> tag which I understood would help.
Is there a way to use appendXML and not strip out the img elements?
$mcustomHTML = "<div style="position:relative; overflow:hidden;"><img src="https://s3.amazonaws.com/a.example.com/image.png" alt="No image" /></img></div>";
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="utf-8" ?>' . $content);
// read all <p> tags and count the text until reach character 120
// then add the custom html into current node
$pTags = $doc->getElementsByTagName('p');
foreach($pTags as $tag) {
$characterCounter += strlen($tag->nodeValue);
if($characterCounter > 120) {
// this is the desired node, so put html code here
$template = $doc->createDocumentFragment();
$template->appendXML($mcustomHTML);
$tag->appendChild($template);
break;
}
}
return $doc->saveHTML();
This should work for you. It uses a temporary DOM document to convert the HTML string that you have into something workable. Then we import the contents of the temporary document into the main one. Once it's imported we can simply append it like any other node.
<?php
$mcustomHTML = '<div style="position:relative; overflow:hidden;"><img src="https://s3.amazonaws.com/a.example.com/image.png" alt="No image" /></div>';
$customDoc = new DOMDocument();
$customDoc->loadHTML($mcustomHTML, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$doc = new DOMDocument();
$doc->loadHTML($content);
$customImport = $doc->importNode($customDoc->documentElement, true);
// read all <p> tags and count the text until reach character 120
// then add the custom html into current node
$pTags = $doc->getElementsByTagName('p');
foreach($pTags as $tag) {
$characterCounter += strlen($tag->nodeValue);
if($characterCounter > 120) {
// this is the desired node, so put html code here
$tag->appendChild($customImport);
break;
}
}
return $doc->saveHTML();

Removing attributes from HTML Tags with PHP

How to remove table's attributes like height, border-spacing and style="";
<table style="border-collapse: collapse" border="0" bordercolor="#000000" cellpadding="3" cellspacing="0" height="80" width="95%">
to this -->
<table>
strip_tags works for ripping tags off, but what about preg_replace?
FYI: loading stuff from database and it has all these weird styling and I want to get rid of them.
If you really want to use preg_replace, this is the way to go, but keep in mind that preg_replace isn't reliable
$output = preg_replace('/(<[^>]+) style=".*?"/i', '$1', $html);
I recommend you to use php DOM that exists for this kind of operation :
// load HTML into a new DOMDocument
$dom = new DOMDocument;
$dom->loadHTML($html);
// Find style attribute with Xpath
$xpath = new DOMXPath($dom);
$styleNodes = $xpath->query('//*[#style]');
// Iterate over nodes and remove style attributes
foreach ($styleNodes as $styleNode) {
$styleNode->removeAttribute('style');
}
// Save the clean HTML into $output
$output = $dom->saveHTML();

How to fetch data (text) from an external website with PHP if possible?

I'm trying to extract data (text) from an external site and put it on my site.
I want to get football scores of an external site and put it on mine.
I've researched and found out I can do this using Preg_Match but i just can't seem to figure out how to extract data within html tags.
For example
this is the HTML structure of an external site.
<td valign="top" align="center" class="s1"><b>Text I Want To Fetch</b></td>
How would I fetch the text within tags? Would help me out allot! THANKS!
You can get the content of a webpage by using file_get_contents method.
Eg:
$content = file_get_contents('http://www.source.com/page.html');
Try this:
<?php
$html = '<td valign="top" align="center" class="s1"><b>Text I Want To Fetch</b></td>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom = $dom->getElementsByTagName('td'); //find td
$dom = $dom->item(0); //traverse the first td
$dom = $dom->getElementsByTagName('b'); //find b
$dom = $dom->item(0); //traverse the first b
$dom = $dom->textContent; //get text
var_dump($dom); //dump it, echo, or print
Output
In this example, there weren't any other textContent, so if your HTML only has text within bold, you may use this as well:
<?php
$html = '<td valign="top" align="center" class="s1"><b>Text I Want To Fetch</b></td>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom = $dom->textContent;
var_dump($dom);
Output
if you're talking about using php to fetch data, then file_get_contents(url) may help; however, you can fetch data using AJAX request with Jquery too. Down here is the link to AJAX documentation:
http://api.jquery.com/jquery.ajax/

XPath keeps returning empty node list

I am trying to parse a folder full of .htm files. All these files contain 1 specific element that needs to be removed.
It's a td element with class="hide". So far, this is my code.
$dir. entry is the full path to the file.
$page = ($dir . $entry);
$this->domDoc->loadHTMLFile($page);
// Use xpath query to find the menu and remove it
$nodeList = $xpath->query('//td[#class="hide"]');
Unfortunately, this is where things already go wrong. If I do a var_dump of the node list, I get the following:
object(DOMNodeList)#5 (0) { }
Just so you folks get an idea of what I'm trying to select, here's an excerpt:
<td width="160" align="left" valign="top" class="hide">
lots of other TD's and content here
</td>
Does anybody see anything wrong with what I've come up with so far?
Is your initial file xhtml (i.e. with <html xmlns="http://www.w3.org/1999/xhtml">)? If so then your elements will be namespaced and you'll need to set up a prefix mapping using $xpath->registerNamespace and then use this prefix in the expression
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodeList = $xpath->query('//xhtml:td[#class="hide"]');
Var dumping an xpath node list object doesn't show anything. Var dump the node list's length.
var_dump($nodeList->length);
If the value is over 0, then you can iterate over it using foreach:
foreach($nodeList as $node)var_dump($node->tagName);
Hope this helps.
For further clarification, here is a full working code snippet:
<?php
$html = <<<END
<html>
<body>
<td>
</td>
<td class="hide"></td>
<td class="hide"></td>
</body>
</html>
END;
$dom = new DOMDocument;
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//td[#class="hide"]');
// Shows a blank object
var_dump($nodeList);
// Shows 2
var_dump($nodeList->length);
// Echo out all the tag names.
foreach($nodeList as $node){
echo $node->tagName . "\n";
}
?>
Maybe you have more then one class in the class attribute of your td element:
<td class="hide anotherclass">
So '//td[#class="hide"]' would only match:
<td class="hide">
Try it like this to see if it contains the hide class you are looking for:
$nodeList = $xpath->query('//td[contains(#class,"hide")]');
Check out this blog post: XPath: Select element by class

How to read the <strong> text and the link url using DOMdocument?

I have this html:
<a href=" URL TO KEEP" class="class_to_check">
<strong> TEXT TO KEEP</strong>
</a>
I have a long html code with many link as above, I have to keep the links that have the <strong> inside, I have to keep the HREF of the link and the text inside the <strong>, how can i do using DOMDocument?
Thank you!
$html = "...";
$dom = new DOMDOcument();
$dom->loadHTML($html);
$xp = new XPath($dom);
$a = $xp->query('//a')->item(0);
$href = $a->getAttribute('href');
$strong = $a->nodeValue;
Of course, this XPath stuff works for just this particular html snippet. You'll have to adjust it to work with a more fully populated HTML tree.

Categories