I'm new to both HTML & PHP. I'm attempting to remove multiple DOM elements from a parsed HTML string.
For example:
<tbody>
<tr>
<td>I'd like to find and remove this text</td>
<td>& possibly this too</td>
<td>can you help?</td>
</tr>
</tbody>
Thanks in advance!
DOMDocument is much better for dealing with DOM manipulation (SimpleXML is good for just parsing):
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNodes = $xpath->query('//text()');
foreach ($textNodes as $node) {
$node->parentNode->removeChild($node);
}
Very simply if I were to remove lots of things from a string in PHP I would create an array of the parts to remove, loop through them and remove them one at a time from the big string.
Like
$html = "Loads of html blah blah blah blah......";
$array = array(
"Remove me",
"and me",
"remove me too"
);
foreach($array as $string){
str_replace($string, '', $html);
}
For more information on str_replace see http://php.net/manual/en/function.str-replace.php
Related
I have a part of HTML string like below which I get from web page scraping.
$scraping_html = "<html><body>
....
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
....</body></html>";
I want to take count of & between the particular div by using PHP. Is it possible to get using any of the PHP preg functions? Thanks in advance.
The hard part is getting the text nodes (I assume that's where you're stuck). Depending on how reliable it needs to be you have two alternatives (just sample code, not actually tested):
Good old strip_tags():
$plain_text = strip_tags($scraping_html);
Proper DOM parser:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($scraping_html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$plain_text = '';
foreach ($xpath->query('//text()') as $textNode) {
$plain_text .= $textNode->nodeValue;
}
To count, you have e.g. substr_count().
To get the number of & in the given example, use DOMDocument:
$html = <<<EOD
<html><body>
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
</body></html>
EOD;
$dom = new DOMDocument;
$dom->loadHTML($html);
$ele = $dom->getElementById('ctl00_ContentPlaceHolder1_lblHdr');
echo substr_count($ele->nodeValue, '&');
I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?
You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.
If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.
Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}
I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.
I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png
From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}
It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.
Using regex in PHP how can I get the 108 from this tag?
<td class="registration">108</td>
Regex isn't a good solution for parsing HTML. Use a DOM Parser instead:
$str = '<td class="registration">108</td>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$tds = $dom->getElementsByTagName('td');
foreach($tds as $td) {
echo $td->nodeValue;
}
Output:
108
Demo!
The above code loads up your HTML string using loadHTML() method, finds all the the <td> tags, loops through the tags, and then echoes the node value.
If you want to get only the specific class name, you can use an XPath:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DomXPath($dom);
// get the td tag with 'registration' class
$tds = $xpath->query("//*[contains(#class, 'registration')]");
foreach($tds as $td) {
echo $td->nodeValue;
}
Demo!
This is similar to the above code, except that it uses XPath to find the required tag. You can find more information about XPaths in the PHP manual documentation. This post should get you started.
If you wish to force regex, use the <td class=["']?registration["']?>(.*)</td> expression
hey guys,
a curl function returns a string $widget that contains regular html -> two divs where the first div holds a table with various values inside of <td>'s.
i wonder what's the easiest and best way for me to extract only all the values inside of the <td>'s so i have blank values without the remaining html.
any idea what the pattern for the preg_match should look like?
thank you.
Regex is not a suitable solution. You're better off loading it up in a DOMDocument and parsing it.
You're betting off using a DOM parser for that task:
$html = <<<HTML
<div>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
<tr>
<td>hello</td>
<td>world</td>
</tr>
</table>
</div>
<div>
Something irrelevant
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $cell) {
echo "{$cell->textContent}\n";
}
Would output:
foo
bar
hello
world
You shouldn't use regexps to parse HTML. Use DOM and XPath instead. Here's an example:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td');
$result = array();
foreach ($nodes as $node) {
$result[] = $node->nodeValue;
}
// $result holds the values of the tds
Only if you have very limited, well-defined HTML can you expect to parse it with regular expressions. The highest ranked SO answer of all time addresses this issue.
He comes ...